{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "作者的github：https://github.com/glemaitre/pyparis-2018-sklearn\n",
    "    \n",
    "    翻译和整理：光城，黄海广"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A more advanced introduction to scikit-learn\n",
    "# scikit-learn的高级介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will draw couple of plots during the tutorial. We activate matplotlib to show the plots inline in the notebook.\n",
    "\n",
    "在本节教程中将会绘制几个图形，于是我们激活matplotlib,使得在notebook中显示内联图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why this tutorial?\n",
    "\n",
    "## 为什么要出这个教程？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`scikit-learn` provides state-of-the-art machine learning algorithms. \n",
    "These algorithms, however, cannot be directly used on raw data. Raw data needs to be preprocessed beforehand. Thus, besides machine learning algorithms, `scikit-learn` provides a set of preprocessing methods. Furthermore, `scikit-learn` provides connectors for pipelining these estimators (i.e., transformer, regressor, classifier, clusterer, etc.).\n",
    "\n",
    "In this tutorial, we will present the set of `scikit-learn` functionalities allowing for pipelining estimators, evaluating those pipelines, tuning those pipelines using hyper-parameters optimization, and creating complex preprocessing steps.\n",
    "\n",
    "`scikit-learn` 提供最先进的机器学习算法。 但是，这些算法不能直接用于原始数据。 原始数据需要事先进行预处理。 因此，除了机器学习算法之外，scikit-learn还提供了一套预处理方法。此外，`scikit-learn` 提供用于流水线化这些估计器的连接器(即转换器，回归器，分类器，聚类器等)。\n",
    "\n",
    "在本教程中,将介绍`scikit-learn` 函数集，允许流水线估计器、评估这些流水线、使用超参数优化调整这些流水线以及创建复杂的预处理步骤。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic use-case: train and test a classifier\n",
    "## 1.基本用例：训练和测试分类器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this first example, we will train and test a classifier on a dataset. We will use this example to recall the API of `scikit-learn`.\n",
    "\n",
    "对于第一个示例，我们将在数据集上训练和测试一个分类器。 我们将使用此示例来回忆`scikit-learn`的API。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the `digits` dataset which is a dataset of hand-written digits.\n",
    "\n",
    "我们将使用`digits`数据集，这是一个手写数字的数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "\n",
    "X, y = load_digits(return_X_y=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each row in `X` contains the intensities of the 64 image pixels. For each sample in `X`, we get the ground-truth `y` indicating the digit written.\n",
    "\n",
    "`X`中的每行包含64个图像像素的强度。 对于`X`中的每个样本，我们得到表示所写数字对应的`y`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The digit in the image is 0\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPgAAAD8CAYAAABaQGkdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAABAlJREFUeJzt3dFNqmkUhtGPyTRAC1gCtgIlaAlagr1YwqEEacESpIR/KpiYSY579DlrXRNeAzz5b0z2btu2BTT99X//AcDXETiECRzCBA5hAocwgUOYwCFM4BAmcAj7+yvedLfbJf897nQ6je69vLyMbV0ul7Gt5+fnsa3b7Ta2NW3btt1nr/EEhzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ9iXnC6qmjwltNZah8NhbGu/349tfXx8jG2dz+exrbXWen19Hd37jCc4hAkcwgQOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ5jAIUzgECZwCBM4hAkcwn786aLj8Ti2NXlKaK217u7uxrbe39/Htn79+jW2Nfn7WMvpImCQwCFM4BAmcAgTOIQJHMIEDmEChzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BD242+T7ff7sa3r9Tq2tdbsvbBJ05/jn8wTHMIEDmEChzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmFOF/0Hl8tlbKts8ju73W5jW9+RJziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFDmMAhTOAQJnAIEziECRzCfvzposnTNMfjcWxr2uQ5ocnP8fX1dWzrO/IEhzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ9hu27bf/6a73e9/039xOBymptbb29vY1lprPT4+jm2dTqexrcnv7P7+fmxr2rZtu89e4wkOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFD2I+/TTbp4eFhdO/p6Wls63q9jm2dz+exrTK3yeAPJ3AIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ5jAIUzgECZwCBM4hAkcwgQOYQKHMIFDmMAhTOAQJnAI+5LTRcD34AkOYQKHMIFDmMAhTOAQJnAIEziECRzCBA5hAocwgUOYwCFM4BAmcAgTOIQJHMIEDmEChzCBQ5jAIUzgECZwCBM4hP0DVJVS9XOb5i4AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x25630f906a0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.imshow(X[0].reshape(8, 8), cmap='gray');# 下面完成灰度图的绘制\n",
    "# 灰度显示图像\n",
    "plt.axis('off')# 关闭坐标轴\n",
    "print('The digit in the image is {}'.format(y[0]))# 格式化打印"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In machine learning, we should evaluate our model by training and testing it on distinct sets of data. `train_test_split` is a utility function to split the data into two independent sets. The `stratify` parameter enforces the classes distribution of the train and test datasets to be the same than the one of the entire dataset.\n",
    "\n",
    "在机器学习中，我们应该通过在不同的数据集上进行训练和测试来评估我们的模型。`train_test_split` 是一个用于将数据拆分为两个独立数据集的效用函数。`stratify`参数可强制将训练和测试数据集的类分布与整个数据集的类分布相同。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\n",
    "\n",
    "# 划分数据为训练集与测试集,添加stratify参数，以使得训练和测试数据集的类分布与整个数据集的类分布相同。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have independent training and testing sets, we can learn a machine learning model using the `fit` method. We will use the `score` method to test this method, relying on the default accuracy metric.\n",
    "\n",
    "一旦我们拥有独立的培训和测试集，我们就可以使用`fit`方法学习机器学习模型。 我们将使用`score`方法来测试此方法，依赖于默认的准确度指标。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the LogisticRegression is 0.96\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression# 求出Logistic回归的精确度得分\n",
    "\n",
    "clf = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=5000, random_state=42)\n",
    "clf.fit(X_train, y_train)\n",
    "accuracy = clf.score(X_test, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The API of `scikit-learn` is consistent across classifiers. Thus, we can easily replace the `LogisticRegression` classifier by a `RandomForestClassifier`. These changes are minimal and only related to the creation of the classifier instance.\n",
    "\n",
    "`scikit-learn`的API在分类器中是一致的。因此，我们可以通过`RandomForestClassifier`轻松替换`LogisticRegression`分类器。这些更改很小，仅与分类器实例的创建有关。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the RandomForestClassifier is 0.96\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "# RandomForestClassifier轻松替换LogisticRegression分类器\n",
    "clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)\n",
    "clf.fit(X_train, y_train)\n",
    "accuracy = clf.score(X_test, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Do the following exercise:\n",
    "\n",
    "#### 练习\n",
    "完成接下来的练习："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Load the breast cancer dataset. Import the functions `load_breast_cancer` from `sklearn.datasets`\n",
    "* 加载乳腺癌数据集。从`sklearn.datasets`导入函数`load_breast_cancer`。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/01_1_solutions.py\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "\n",
    "X_breast, y_breast = load_breast_cancer(return_X_y=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Split the dataset to keep 30% of it for testing using `sklearn.model_selection.train_test_split`. Make sure to stratify the data (i.e., use the `stratify` parameter) and set the `random_state` to `0`.\n",
    "* 使用`sklearn.model_selection.train_test_split`拆分数据集并保留30％的数据集以进行测试。确保对数据进行分层（即使用 `stratify`参数）并将`random_state`设置为`0`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/01_2_solutions.py\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_breast_train, X_breast_test, y_breast_train, y_breast_test = train_test_split(X_breast, y_breast, stratify=y_breast, random_state=0, test_size=0.3)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Train a supervised classifier using the training data.\n",
    "* 使用训练数据训练监督分类器。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
       "              learning_rate=0.1, loss='deviance', max_depth=3,\n",
       "              max_features=None, max_leaf_nodes=None,\n",
       "              min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "              min_samples_leaf=1, min_samples_split=2,\n",
       "              min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "              n_iter_no_change=None, presort='auto', random_state=0,\n",
       "              subsample=1.0, tol=0.0001, validation_fraction=0.1,\n",
       "              verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# %load solutions/01_3_solutions.py\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "\n",
    "clf = GradientBoostingClassifier(n_estimators=100, random_state=0)\n",
    "clf.fit(X_breast_train, y_breast_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Use the fitted classifier to predict the classification labels for the testing set.\n",
    "* 使用拟合分类器预测测试集的分类标签。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/01_4_solutions.py\n",
    "y_pred = clf.predict(X_breast_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Compute the balanced accuracy on the testing set. You need to import `balanced_accuracy_score` from `sklearn.metrics`\n",
    "* 计算测试集的`balanced_accuracy_score`精度。您需要从`sklearn.metrics`导入`balanced_accuracy_score`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the GradientBoostingClassifier is 0.94\n"
     ]
    }
   ],
   "source": [
    "# %load solutions/01_5_solutions.py\n",
    "from sklearn.metrics import balanced_accuracy_score\n",
    "\n",
    "accuracy = balanced_accuracy_score(y_breast_test, y_pred)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. More advanced use-case: preprocess the data before training and testing a classifier\n",
    "## 2.更高级的用例：在训练和测试分类器之前预处理数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Standardize your data\n",
    "### 2.1 标准化您的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preprocessing might be required before learning a model. For instance, a user could be interested in creating hand-crafted features or an algorithm might make some apriori assumptions about the data. \n",
    "\n",
    "In our case, the solver used by the `LogisticRegression` expects the data to be normalized. Thus, we need to standardize the data before training the model. To observe this necessary condition, we will check the number of iterations required to train the model.\n",
    "\n",
    "在学习模型之前可能需要预处理。例如，一个用户可能对创建手工制作的特征或者算法感兴趣，那么他可能会对数据进行一些先验假设。\n",
    "\n",
    "在我们的例子中，`LogisticRegression`使用的求解器期望数据被规范化。因此，我们需要在训练模型之前标准化数据。为了观察这个必要条件，我们将检查训练模型所需的迭代次数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LogisticRegression required 2152 iterations to be fitted\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)\n",
    "clf.fit(X_train, y_train)\n",
    "print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `MinMaxScaler` transformer is used to normalise the data. This scaler should be applied in the following way: learn (i.e., `fit` method) the statistics on a training set and standardize (i.e., `transform` method) both the training and testing sets. Finally, we will train and test the model and the scaled datasets.\n",
    "\n",
    "`MinMaxScaler`变换器用于规范化数据。该标量应该以下列方式应用：学习（即，`fit`方法）训练集上的统计数据并标准化（即，`transform`方法）训练集和测试集。 最后，我们将训练和测试这个模型并得到归一化后的数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the LogisticRegression is 0.96\n",
      "LogisticRegression required 177 iterations to be fitted\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "scaler = MinMaxScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)\n",
    "clf.fit(X_train_scaled, y_train)\n",
    "accuracy = clf.score(X_test_scaled, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))\n",
    "print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By scaling the data, the convergence of the model happened much faster than with the unscaled data.\n",
    "\n",
    "通过归一化数据，模型的收敛速度要比未归一化的数据快得多。(迭代次数变少了)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 The wrong preprocessing patterns\n",
    "### 2.2 错误的预处理模式\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We highlighted how to preprocess and adequately train a machine learning model. It is also interesting to spot what would be the wrong way of preprocessing data. There are two potential mistakes which are easy to make but easy to spot.\n",
    "\n",
    "我们强调了如何预处理和充分训练机器学习模型。发现预处理数据的错误方法也很有趣。其中有两个潜在的错误，易于犯错但又很容易发现。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first pattern is to standardize the data before spliting the full set into training and testing sets.\n",
    "\n",
    "\n",
    "第一种模式是在整个数据集分成训练和测试集之前标准化数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the LogisticRegression is 0.96\n"
     ]
    }
   ],
   "source": [
    "scaler = MinMaxScaler()\n",
    "X_scaled = scaler.fit_transform(X)\n",
    "X_train_prescaled, X_test_prescaled, y_train_prescaled, y_test_prescaled = train_test_split(\n",
    "    X_scaled, y, stratify=y, random_state=42)\n",
    "\n",
    "clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)\n",
    "clf.fit(X_train_prescaled, y_train_prescaled)\n",
    "accuracy = clf.score(X_test_prescaled, y_test_prescaled)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second pattern is to standardize the training and testing sets independently. It comes back to call the `fit` methods on both training and testing sets. Thus, the training and testing sets are standardized differently.\n",
    "\n",
    "第二种模式是独立地标准化训练和测试集。它回来在训练和测试集上调用`fit`方法。因此，训练和测试集的标准化不同。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the LogisticRegression is 0.96\n"
     ]
    }
   ],
   "source": [
    "scaler = MinMaxScaler()\n",
    "X_train_prescaled = scaler.fit_transform(X_train)\n",
    "X_test_prescaled = scaler.fit_transform(X_test)\n",
    "\n",
    "clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)\n",
    "clf.fit(X_train_prescaled, y_train)\n",
    "accuracy = clf.score(X_test_prescaled, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Keep it simple, stupid: use the pipeline connector from `scikit-learn`\n",
    "### 2.3 保持简单，笨的方法：使用`scikit-learn`的管道连接器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two previous patterns are an issue with data leaking. However, this is difficult to prevent such a mistake when one has to do the preprocessing by hand. Thus, `scikit-learn` introduced the `Pipeline` object. It sequentially connects several transformers and a classifier (or a regressor). We can create a pipeline as:\n",
    "\n",
    "前面提到的两个模式是数据泄漏的问题。然而，当必须手动进行预处理时，很难防止这种错误。因此,`scikit-learn`引入了`Pipeline`对象。它依次连接多个转换器和分类器（或回归器）。我们可以创建一个如下管道："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "pipe = Pipeline(steps=[('scaler', MinMaxScaler()),\n",
    "                       ('clf', LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42))])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that this pipeline contains the parameters of both the scaler and the classifier. Sometimes, it can be tedious to give a name to each estimator in the pipeline. `make_pipeline` will give a name automatically to each estimator which is the lower case of the class name.\n",
    "\n",
    "我们看到这个管道包含了缩放器(归一化)和分类器的参数。 有时，为管道中的每个估计器命名可能会很繁琐。 而`make_pipeline`将自动为每个估计器命名，这是类名的小写。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "pipe = make_pipeline(MinMaxScaler(),\n",
    "                     LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42, max_iter=1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pipeline will have an identical API. We use `fit` to train the classifier and `score` to check the accuracy. However, calling `fit` will call the method `fit_transform` of all transformers in the pipeline. Calling `score` (or `predict` and `predict_proba`) will call internally `transform` of all transformers in the pipeline. It corresponds to the normalization procedure in Sect. 2.1.\n",
    "\n",
    "管道将具有相同的API。 我们使用`fit`来训练分类器和`socre`来检查准确性。 然而，调用`fit`会调用管道中所有变换器的`fit_transform`方法。 调用`score`（或`predict`和`predict_proba`）将调用管道中所有转换器的内部变换。 它对应于本文2.1中的规范化过程。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the Pipeline is 0.96\n"
     ]
    }
   ],
   "source": [
    "pipe.fit(X_train, y_train)\n",
    "accuracy = pipe.score(X_test, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check all the parameters of the pipeline using `get_params()`.\n",
    "\n",
    "我们可以使用`get_params()`检查管道的所有参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "           intercept_scaling=1, max_iter=1000, multi_class='auto',\n",
       "           n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',\n",
       "           tol=0.0001, verbose=0, warm_start=False),\n",
       " 'logisticregression__C': 1.0,\n",
       " 'logisticregression__class_weight': None,\n",
       " 'logisticregression__dual': False,\n",
       " 'logisticregression__fit_intercept': True,\n",
       " 'logisticregression__intercept_scaling': 1,\n",
       " 'logisticregression__max_iter': 1000,\n",
       " 'logisticregression__multi_class': 'auto',\n",
       " 'logisticregression__n_jobs': None,\n",
       " 'logisticregression__penalty': 'l2',\n",
       " 'logisticregression__random_state': 42,\n",
       " 'logisticregression__solver': 'lbfgs',\n",
       " 'logisticregression__tol': 0.0001,\n",
       " 'logisticregression__verbose': 0,\n",
       " 'logisticregression__warm_start': False,\n",
       " 'memory': None,\n",
       " 'minmaxscaler': MinMaxScaler(copy=True, feature_range=(0, 1)),\n",
       " 'minmaxscaler__copy': True,\n",
       " 'minmaxscaler__feature_range': (0, 1),\n",
       " 'steps': [('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),\n",
       "  ('logisticregression',\n",
       "   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "             intercept_scaling=1, max_iter=1000, multi_class='auto',\n",
       "             n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',\n",
       "             tol=0.0001, verbose=0, warm_start=False))]}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Reuse the breast dataset of the first exercise to train a `SGDClassifier` which you can import from `linear_model`. Make a pipeline with this classifier and a `StandardScaler` transformer imported from `sklearn.preprocessing`. Train and test this pipeline.\n",
    "\n",
    "### 练习\n",
    "重用第一个练习的乳腺癌数据集来训练,可以从`linear_model`导入`SGDClassifier`。 使用此分类器和从`sklearn.preprocessing`导入的`StandardScaler`变换器来创建管道。然后训练和测试这条管道。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the Pipeline is 0.95\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "# %load solutions/02_solutions.py\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "\n",
    "pipe = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000))\n",
    "pipe.fit(X_breast_train, y_breast_train)\n",
    "y_pred = pipe.predict(X_breast_test)\n",
    "accuracy = balanced_accuracy_score(y_breast_test, y_pred)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. When more is better than less: cross-validation instead of single split\n",
    "### 3.当更多优于更少时：交叉验证而不是单独拆分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Splitting the data is necessary to evaluate the statistical model performance. However, it reduces the number of samples which can be used to learn the model. Therefore, one should use cross-validation whenever possible. Having multiple splits will give information about the model stability as well. \n",
    "\n",
    "`scikit-learn` provides three functions: `cross_val_score`, `cross_val_predict`, and [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html). The latter provides more information regarding fitting time, training and testing scores. I can also return multiple scores at once.\n",
    "\n",
    "分割数据对于评估统计模型性能是必要的。 但是，它减少了可用于学习模型的样本数量。 因此，应尽可能使用交叉验证。有多个拆分也会提供有关模型稳定性的信息。\n",
    "\n",
    "`scikit-learn`提供了三个函数：`cross_val_score`，`cross_val_predict`和`cross_validate`。 后者提供了有关拟合时间，训练和测试分数的更多信息。 我也可以一次返回多个分数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "pipe = make_pipeline(MinMaxScaler(),\n",
    "                     LogisticRegression(solver='lbfgs', multi_class='auto',\n",
    "                                        max_iter=1000, random_state=42))\n",
    "scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the cross-validate function, we can quickly check the training and testing scores and make a quick plot using `pandas`.\n",
    "\n",
    "使用交叉验证函数，我们可以快速检查训练和测试分数，并使用`pandas`快速绘图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fit_time</th>\n",
       "      <th>score_time</th>\n",
       "      <th>test_score</th>\n",
       "      <th>train_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.154587</td>\n",
       "      <td>0.000997</td>\n",
       "      <td>0.925249</td>\n",
       "      <td>0.988285</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.132646</td>\n",
       "      <td>0.000997</td>\n",
       "      <td>0.943239</td>\n",
       "      <td>0.984975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.127659</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.924497</td>\n",
       "      <td>0.993339</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   fit_time  score_time  test_score  train_score\n",
       "0  0.154587    0.000997    0.925249     0.988285\n",
       "1  0.132646    0.000997    0.943239     0.984975\n",
       "2  0.127659    0.000000    0.924497     0.993339"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_scores = pd.DataFrame(scores)\n",
    "df_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x25635300550>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD9CAYAAABQvqc9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFXVJREFUeJzt3X+wXOV93/H3x+KHiaVggsitLRGJJqRFNTKxb7Ad1/aNm7gQOmBQxkBqd5h0Ru04jCeZIY6YtjhWy4BrktYpZCbylBiauJRREoYY2UAVLXRqxwZqJBAaURkTI+S6jusovoADot/+sUfusrrS3ftDuhc979fMjs55nuec8+zR2c+e+5w9u6kqJElteM1Cd0CSdOwY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGnLDQHRi2fPnyWr169UJ347jx3HPP8brXvW6huyFNyeNz/jzyyCN/WVVnTNdu0YX+6tWrefjhhxe6G8eNXq/HxMTEQndDmpLH5/xJ8hejtHN4R5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktSQRXdzlmYnyYyX8feRpfZ4pn+cqKopH6t+43OHrZPUHkNfkhpi6EtSQwx9SWqIoS9JDRkp9JNckGR3kj1JNkxRvyrJ1iQ7kvSSrByo+0SSx7vH5fPZeUnSzEwb+kmWALcAFwJrgCuTrBlqdhNwe1WtBTYCN3TLXgS8BTgPeBvw60l+eP66L0maiVE+p38+sKeqngJIcgdwCfDEQJs1wK9109uAuwbKH6iqA8CBJNuBC4A756HvTXrzx+9j/wsvzWiZ1RvumVH7U085ke0fe9+MlpH06jBK6K8AnhmY30v/rH3QdmAd8CngUmBZktO78o8l+W3gh4Cf5ZVvFpqh/S+8xNM3XjRy+9n8MtFM3yQkvXqMEvpT3eo5fGfPNcDNSa4CHgSeBQ5U1X1Jfhr4IvBt4EvAgUM2kKwH1gOMjY3R6/VG7X+TZrJ/JicnZ7U//T/QsTDb41OzN0ro7wXOHJhfCewbbFBV+4DLAJIsBdZV1f6u7nrg+q7us8D/HN5AVW0CNgGMj4+Xv5l5BF+4Z0Zn7rP6DdIZbkOaLX8j99gb5dM7DwFnJzkryUnAFcDdgw2SLE9ycF3XArd25Uu6YR6SrAXWAvfNV+clSTMz7Zl+VR1IcjVwL7AEuLWqdibZCDxcVXcDE8ANSYr+8M6vdIufCPy37svA/hr4YHdRV5K0AEb6ls2q2gJsGSq7bmB6M7B5iuW+T/8TPJKkRcCvVn6VWXbOBs697ZD7447stpluA2D0TwhJevUw9F9lvrfrRj+yKWnW/O4dSWqIoS9JDTH0Jakhjum/Cs14zP0LM//uHUnHJ0P/VWYmF3Gh/wYx02UkHb8c3pGkhhj6ktQQh3eOE91XXUxd94mpy6uGvyxV0vHOM/3jRFVN+di2bdth6yS1x9CXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGjBT6SS5IsjvJniSH/EBrklVJtibZkaSXZOVA3b9NsjPJriS/kyN9X4Ak6aiaNvSTLAFuAS4E1gBXJlkz1Owm4PaqWgtsBG7olv0Z4J3AWuBNwE8D75m33kuSZmSUM/3zgT1V9VRVvQjcAVwy1GYNsLWb3jZQX8BrgZOAk4ETgW/NtdOSpNkZJfRXAM8MzO/tygZtB9Z105cCy5KcXlVfov8m8M3ucW9V7ZpblyVJszXKVytPNQY//BWN1wA3J7kKeBB4FjiQ5CeAc4CDY/z3J3l3VT34ig0k64H1AGNjY/R6vZGfgI5scnLS/alFy+Pz2Bsl9PcCZw7MrwT2DTaoqn3AZQBJlgLrqmp/F+Z/XlWTXd3ngbfTf2MYXH4TsAlgfHy8JiYmZvVkdKher4f7U4uVx+exN8rwzkPA2UnOSnIScAVw92CDJMuTHFzXtcCt3fQ3gPckOSHJifQv4jq8I0kLZNrQr6oDwNXAvfQD+86q2plkY5KLu2YTwO4kTwJjwPVd+Wbga8Bj9Mf9t1fVn87vU5AkjWqkn0usqi3AlqGy6wamN9MP+OHlXgb+2Rz7KEmaJ96RK0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0JekhowU+kkuSLI7yZ4kG6aoX5Vka5IdSXpJVnblP5vk0YHH95O8f76fhCRpNNOGfpIlwC3AhcAa4Moka4aa3QTcXlVrgY3ADQBVta2qzquq84D3As8D981j/yVJMzDKmf75wJ6qeqqqXgTuAC4ZarMG2NpNb5uiHuAXgc9X1fOz7awkaW5OGKHNCuCZgfm9wNuG2mwH1gGfAi4FliU5vaq+M9DmCuC3p9pAkvXAeoCxsTF6vd5Indf0Jicn3Z9atDw+j71RQj9TlNXQ/DXAzUmuAh4EngUO/GAFyRuAc4F7p9pAVW0CNgGMj4/XxMTECN3SKHq9Hu5PLVYen8feKKG/FzhzYH4lsG+wQVXtAy4DSLIUWFdV+weafAD4k6p6aW7dlSTNxShj+g8BZyc5K8lJ9Idp7h5skGR5koPruha4dWgdVwL/ea6dlSTNzbShX1UHgKvpD83sAu6sqp1JNia5uGs2AexO8iQwBlx/cPkkq+n/pfDAvPZckjRjowzvUFVbgC1DZdcNTG8GNh9m2afpXwyWJC0w78iVpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGjJS6Ce5IMnuJHuSbJiiflWSrUl2JOklWTlQ92NJ7kuyK8kT3Q+lS5IWwLShn2QJcAtwIbAGuDLJmqFmNwG3V9VaYCNww0Dd7cAnq+oc4Hzgf89HxyVJMzfKmf75wJ6qeqqqXgTuAC4ZarMG2NpNbztY3705nFBV9wNU1WRVPT8vPZckzdgoob8CeGZgfm9XNmg7sK6bvhRYluR04CeBv0ryx0m+muST3V8OkqQFcMIIbTJFWQ3NXwPcnOQq4EHgWeBAt/53AT8FfAP4L8BVwH98xQaS9cB6gLGxMXq93qj91zQmJyfdn1q0PD6PvVFCfy9w5sD8SmDfYIOq2gdcBpBkKbCuqvYn2Qt8taqe6uruAt7OUOhX1SZgE8D4+HhNTEzM6snoUL1eD/enFiuPz2NvlOGdh4Czk5yV5CTgCuDuwQZJlic5uK5rgVsHlj0tyRnd/HuBJ+bebUnSbEwb+lV1ALgauBfYBdxZVTuTbExycddsAtid5ElgDLi+W/Zl+kM/W5M8Rn+o6NPz/iwkSSMZZXiHqtoCbBkqu25gejOw+TDL3g+snUMfJUnzxDtyJakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUkJFCP8kFSXYn2ZNkwxT1q5JsTbIjSS/JyoG6l5M82j3uns/OS5Jm5oTpGiRZAtwC/DywF3goyd1V9cRAs5uA26vqtiTvBW4APtTVvVBV581zvyVJszDKmf75wJ6qeqqqXgTuAC4ZarMG2NpNb5uiXpK0CIwS+iuAZwbm93Zlg7YD67rpS4FlSU7v5l+b5OEkf57k/XPqrSRpTqYd3gEyRVkNzV8D3JzkKuBB4FngQFf3Y1W1L8nfBv4syWNV9bVXbCBZD6wHGBsbo9frjf4MdESTk5PuTy1aHp/H3iihvxc4c2B+JbBvsEFV7QMuA0iyFFhXVfsH6qiqp5L0gJ8Cvja0/CZgE8D4+HhNTEzM4qloKr1eD/enFiuPz2NvlOGdh4Czk5yV5CTgCuAVn8JJsjzJwXVdC9zalZ+W5OSDbYB3AoMXgCVJx9C0oV9VB4CrgXuBXcCdVbUzycYkF3fNJoDdSZ4ExoDru/JzgIeTbKd/gffGoU/9SJKOoVGGd6iqLcCWobLrBqY3A5unWO6LwLlz7KMkaZ54R64kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JashIX60sSbOVTPWLq9OrGv5VVs0Hz/QlHVVVddjHqt/43GHrdHQY+pLUEENfkhpi6EtSQwx9SWrISKGf5IIku5PsSbJhivpVSbYm2ZGkl2TlUP0PJ3k2yc3z1XFJ0sxNG/pJlgC3ABcCa4Ark6wZanYTcHtVrQU2AjcM1f9r4IG5d1eSNBejnOmfD+ypqqeq6kXgDuCSoTZrgK3d9LbB+iRvBcaA++beXUnSXIxyc9YK4JmB+b3A24babAfWAZ8CLgWWJTkd+C7wW8CHgH9wuA0kWQ+sBxgbG6PX643YfU1ncnLS/alFzePz2Bol9Ke6nW74zolrgJuTXAU8CDwLHAA+DGypqmeOdFdeVW0CNgGMj4/XxMTECN3SKHq9Hu5PLVpfuMfj8xgbJfT3AmcOzK8E9g02qKp9wGUASZYC66pqf5J3AO9K8mFgKXBSksmqOuRisCTp6Bsl9B8Czk5yFv0z+CuAXxpskGQ58H+q6v8C1wK3AlTVPx5ocxUwbuBL0sKZ9kJuVR0ArgbuBXYBd1bVziQbk1zcNZsAdid5kv5F2+uPUn8lSXMw0rdsVtUWYMtQ2XUD05uBzdOs4zPAZ2bcQ0nSvPGOXElqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSEjffeOJE3nzR+/j/0vvDTj5VZvuGfktqeeciLbP/a+GW9D/5+hL2le7H/hJZ6+8aIZLTPTH/mZyRuEpubwjiQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWrISKGf5IIku5PsSbJhivpVSbYm2ZGkl2TlQPkjSR5NsjPJP5/vJyBJGt20oZ9kCXALcCGwBrgyyZqhZjcBt1fVWmAjcENX/k3gZ6rqPOBtwIYkb5yvzkuSZmaUM/3zgT1V9VRVvQjcAVwy1GYNsLWb3nawvqperKq/6cpPHnF7kqSjZJQQXgE8MzC/tysbtB1Y101fCixLcjpAkjOT7OjW8Ymq2je3LkuSZmuUr2HIFGU1NH8NcHOSq4AHgWeBAwBV9QywthvWuSvJ5qr61is2kKwH1gOMjY3R6/Vm8hx0BJOTk+5PHTMzPdZmc3x6PM/NKKG/FzhzYH4l8Iqz9e7s/TKAJEuBdVW1f7hNkp3Au4DNQ3WbgE0A4+PjNZPv4tCRzfS7TaRZ+8I9Mz7WZnx8zmIbeqVRhnceAs5OclaSk4ArgLsHGyRZnuTguq4Fbu3KVyY5pZs+DXgnsHu+Oi9JmplpQ7+qDgBXA/cCu4A7q2pnko1JLu6aTQC7kzwJjAHXd+XnAF9Osh14ALipqh6b5+cgSRrRSF+tXFVbgC1DZdcNTG9maMimK78fWDvHPkqS5okfoZSkhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIaMdEeuJE1n2TkbOPe2Q35Yb3q3zWQbABfNfBv6AUNf0rz43q4bj/o2Tj3lxKO+jeOdoS9pXjx949Rn4MlUP8kxvarhn+3QfHBMX9JRVVWHfWzbtu2wdTo6DH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQ7LYboJI8m3gLxa6H8eR5cBfLnQnpMPw+Jw/q6rqjOkaLbrQ1/xK8nBVjS90P6SpeHweew7vSFJDDH1Jaoihf/zbtNAdkI7A4/MYc0xfkhrimb4kNcTQl3SIJK9P8uFZLvurSX5ovvuk+WHoL4DZvqCSbEny+qPRJ2nI64FZhT7wq8AxC/0kS47Vto4Hhv7CmPIFNd3BW1W/UFV/ddR6NSJfZE24EfjxJI8m+WSSX0/yUJIdST4OkOR1Se5Jsj3J40kuT/IR4I3AtiTbplpxkiVJPtMt81iSX+vKfyLJf+3W9z+S/Hj6PjnQ9vKu7USSbUk+CzzWlX0wyVe6Pv+ex+lhHOmnzHwcnQdwB/AC8CjwELAN+CzwRFd/F/AIsBNYP7Dc0/TvYFwN7AI+3bW5DzjlCNv7CPAEsAO4oytbCvw+/RfMDmBdV35lV/Y48ImBdUwCG4EvA38feCvwQNfPe4E3LPR+9TGvx+hq4PFu+n30P2UT+ieKnwPeDawDPj2wzKndv08Dy4+w7rcC9w/Mv77798vApd30a+n/tbAOuB9YAowB3wDeAEwAzwFnde3PAf4UOLGb/13gnyz0flyMjwXvQIuPoRfUKw7eruxHun9P6cL39G5+MPQPAOd15XcCHzzC9vYBJ3fTB19gnwD+/UCb0+ifoX0DOAM4Afgz4P1dfQEf6KZPBL4InNHNXw7cutD71cdRO0Zv6o69R7vHHuCfAj8JfL07lt41sOx0oX8a8DXgPwAXdG8ky4C9U7T9d8AvD8z/J+Di7nWzbaD86u44P9jH3cBvLvR+XIyPE9Bi8JWq+vrA/EeSXNpNnwmcDXxnaJmvV9Wj3fQj9F+kh7MD+MMkd9H/KwLg54ArDjaoqu8meTfQq6pvAyT5Q/pndHcBLwN/1DX/O8CbgPuTQP8s7JujPVW9CgW4oap+75CK5K3ALwA3JLmvqjZOt7LuWHsz8A+BXwE+QP86wOG2fTjPDbW7raqunW77rXNMf3H4wcGbZIJ+IL+jqt4MfJX+n7rD/mZg+mU44hv4RcAt9P+sfiTJCfRfJMM3aRzpBfb9qnp5oN3Oqjqve5xbVe87wrJ69fke/bNv6A/f/XKSpQBJViT50SRvBJ6vqj+g/9fAW6ZY9hBJlgOvqao/Av4V8Jaq+mtgb5L3d21O7j4B9CBweXcd4Az6JyFfmWK1W4FfTPKj3fI/kmTVXHbA8crQXxhHelGcCny3qp5P8neBt89lQ0leA5xZVduAj9K/iLyU/nWAqwfanUZ/TPU9SZZ3F8GupD9uP2w3cEaSd3TLnpjk782ln1pcquo7wH9P8jjw8/SvOX0pyWPAZvrH77nAV5I8CvwL4N90i28CPn+4C7nACqDXLfcZ4ODZ+Yfo/5W7g/7w4d8C/oT+X6rb6Q83frSq/tcU/X0C+JfAfd3y99Mf+9cQ78hdIN2nDtbSv6D7rar6R135yfSHU1bQhSv9sclekqeBcfqh/bmqelO3zDXA0qr6zSm2cyL9C8Wn0j9D/4OqurE7azt49v8y8PGq+uMkv0T/RRhgS1V9tFvPZFUtHVjvecDvdOs9gf71gU/P4y6SdBQY+pLUEC/kSjpqknwZOHmo+ENV9dhC9Eee6R9XktwCvHOo+FNV9fsL0R9Ji4+hL0kN8dM7ktQQQ1+SGmLoS1JDDH1JaoihL0kN+X9kxn8Z+kPMtgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x25635298eb8>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_scores[['train_score', 'test_score']].boxplot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Use the pipeline of the previous exercise and make a cross-validation instead of a single split evaluation.\n",
    "\n",
    "#### 练习\n",
    "使用上一个练习的管道并进行交叉验证，而不是单个拆分评估。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n",
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n",
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x2563538fd68>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD9CAYAAABQvqc9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAGBlJREFUeJzt3X9wVed95/H3JxgcNxDbAUd1EAW2pa2VmNBYxUmzSa4zWxfijglWJjbZ/GC7M9qdhMm0MyQrZlunUZcBN3Q2ycK2VWapTduUurT10KAYiKJrdppf2K0BY0aOQpwglCZpnNLIdmLL+faPc0gPlyvpXOkiCT+f18wdznnOc8557p3nfjh6zrnnKCIwM7M0vGSmG2BmZtPHoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXkipluQK1FixbFsmXLZroZLxpPP/00L3vZy2a6GWZ1uX82zyOPPPLPEXHdRPVmXegvW7aMhx9+eKab8aJRrVapVCoz3Qyzutw/m0fSN8rU8/COmVlCHPpmZglx6JuZJcShb2aWEIe+mVlCJgx9SbslfUfSY2Msl6RPShqUdFzS6wrL3ifpq/nrfc1suJmZNa7Mkf69wJpxlq8FVuSvTuAPASS9AvgIcDOwGviIpGun0lgzM5uaCUM/Io4AT41TZR2wJzJfAq6RdD3wa8DhiHgqIr4PHGb8/zzMzOwSa8aPsxYDZwrzQ3nZWOUXkdRJ9lcCLS0tVKvVJjQrLbfcckvD6/T391+ClpiVNzIy4u/7NGtG6KtOWYxTfnFhRA/QA9De3h7+hV7jxnrA/bKuAzy5/bZpbo1ZOf5F7vRrxtU7Q8CSwnwrMDxOuZmZzZBmhP5+4L35VTyvB85FxLeAg8Ctkq7NT+DempeZmdkMmXB4R9JfABVgkaQhsity5gJExB8BvcDbgEHgGeC/5MuekvR7wNF8U90RMd4JYTMzu8QmDP2I2DDB8gA+MMay3cDuyTXN6nntRw9x7tnnG1pnWdeBhupffdVcjn3k1obWMbPLw6y7tbKN79yzzzd0YnYyJ8oa/U/CzC4fvg2DmVlCHPpmZgnx8M5lZsENXdx4X1djK93X6D4AfG2/2YuRQ/8y84NT2z2mb2aT5uEdM7OEOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLiSzYvQw1fUvlg4/feMbMXJ4f+ZabRB6L4ISpmVuThHTOzhDj0zcwS4tA3M0tIqdCXtEbSgKRBSRfd7UvSUkl9ko5LqkpqLSy7R9Jj+evOZjbezMwaM2HoS5oD7ALWAm3ABkltNdV2AHsiYiXQDWzL170NeB2wCrgZ+JCklzev+WZm1ogyR/qrgcGIOB0RzwF7gXU1ddqAvny6v7C8DXgoIkYj4mngGLBm6s22WpLqvr5xz6+PuczM0lMm9BcDZwrzQ3lZ0TGgI59eDyyQtDAvXyvppyQtAm4BlkytyVZPRNR99ff3j7nMzNJT5jr9eoeEtYmxGdgpaSNwBDgLjEbEIUm/DHwB+C7wRWD0oh1InUAnQEtLC9VqtWz7bQIjIyP+PG3Wcv+cfmVCf4gLj85bgeFihYgYBu4AkDQf6IiIc/myrcDWfNmnga/W7iAieoAegPb29mj0oR82tsk8RMVsurh/Tr8ywztHgRWSlkuaB9wF7C9WkLRI0vltbQF25+Vz8mEeJK0EVgKHmtV4MzNrzIRH+hExKmkTcBCYA+yOiJOSuoGHI2I/UAG2SQqy4Z0P5KvPBf5/ftLwX4F3R8RFwztmZjY9St17JyJ6gd6asrsL0/uAfXXW+yHZFTxmZjYL+Be5ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJKRX6ktZIGpA0KKmrzvKlkvokHZdUldRaWPb7kk5KOiXpk8qfnWhmZtNvwtCXNAfYBawle/ThBkm1j0DcAeyJiJVAN7AtX/dXgDeSPRD9NcAvA29pWuvNzKwhZY70VwODEXE6Ip4D9gLrauq0AX35dH9heQAvBeYBV5I9KP3bU220mZlNTpnQXwycKcwP5WVFx4COfHo9sEDSwoj4Itl/At/KXwcj4tTUmmxmZpN1RYk69cbgo2Z+M7BT0kbgCHAWGJX0c8ANwPkx/sOS3hwRRy7YgdQJdAK0tLRQrVZLvwEb38jIiD9Pm7XcP6dfmdAfApYU5luB4WKFiBgG7gCQNB/oiIhzeZh/KSJG8mWfBV5P9h9Dcf0eoAegvb09KpXKpN6MXaxareLP02Yr98/pV2Z45yiwQtJySfOAu4D9xQqSFkk6v60twO58+pvAWyRdIWku2UlcD++Ymc2QCUM/IkaBTcBBssC+PyJOSuqWdHterQIMSHoCaAG25uX7gK8BJ8jG/Y9FxN819y2YmVlZZYZ3iIheoLem7O7C9D6ygK9d7wXgv02xjWZm1iT+Ra6ZWUIc+mZmCXHom5klxKFvZpYQh76ZWUIc+mZmCXHom5klxKFvZpYQh76ZWUIc+mZmCXHom5klxKFvZpYQh76ZWUIc+mZmCXHom5klpNT99M3MJkuq95jtiUXUPorbmqHUkb6kNZIGJA1K6qqzfKmkPknHJVUlteblt0h6tPD6oaS3N/tNmNnsFRFjvpb+j8+MucwujQlDX9IcYBewFmgDNkhqq6m2A9gTESuBbmAbQET0R8SqiFgFvBV4BjjUxPabmVkDyhzprwYGI+J0RDwH7AXW1dRpA/ry6f46ywHeAXw2Ip6ZbGPNzGxqyoT+YuBMYX4oLys6BnTk0+uBBZIW1tS5C/iLyTTSzMyao8yJ3HpnYWoH3DYDOyVtBI4AZ4HRn2xAuh64EThYdwdSJ9AJ0NLSQrVaLdEsK2NkZMSfp81q7p/Tq0zoDwFLCvOtwHCxQkQMA3cASJoPdETEuUKVdwJ/GxHP19tBRPQAPQDt7e1RqVTKtt8mUK1W8edps9aDB9w/p1mZ4Z2jwApJyyXNIxum2V+sIGmRpPPb2gLsrtnGBjy0Y2Y24yYM/YgYBTaRDc2cAu6PiJOSuiXdnlerAAOSngBagK3n15e0jOwvhYea2nIzM2tYqR9nRUQv0FtTdndheh+wb4x1n+TiE79mZjYDfBsGM7OEOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIQ59M7OElAp9SWskDUgalNRVZ/lSSX2SjkuqSmotLPsZSYcknZL0eP74RDMzmwEThr6kOcAuYC3QBmyQ1FZTbQewJyJWAt3AtsKyPcDHIuIGYDXwnWY03MzMGlfmGbmrgcGIOA0gaS+wDni8UKcN+K18uh94IK/bBlwREYcBImKkSe02s1nmtR89xLlnn294vWVdB0rXvfqquRz7yK0N78P+XZnQXwycKcwPATfX1DkGdACfANYDCyQtBH4e+BdJfwMsBz4HdEXEC1NtuJnNLueefZ4nt9/W0DrVapVKpVK6fiP/QVh9ZUJfdcqiZn4zsFPSRuAIcBYYzbf/JuCXgG8CfwlsBP7fBTuQOoFOgJaWFqrVatn22wRGRkb8edq0abSvTaZ/uj9PTZnQHwKWFOZbgeFihYgYBu4AkDQf6IiIc5KGgH8sDA09ALyemtCPiB6gB6C9vT0a+Z/fxtfokZTZpD14oOG+1nD/nMQ+7EJlrt45CqyQtFzSPOAuYH+xgqRFks5vawuwu7DutZKuy+ffyoXnAszMbBpNGPoRMQpsAg4Cp4D7I+KkpG5Jt+fVKsCApCeAFmBrvu4LZEM/fZJOkA0Vfarp78LMzEopM7xDRPQCvTVldxem9wH7xlj3MLByCm00M7Mm8S9yzcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwSUir0Ja2RNCBpUFJXneVLJfVJOi6pKqm1sOwFSY/mr/2165qZ2fSZ8HGJkuYAu4BfBYaAo5L2R0TxAec7gD0RcZ+ktwLbgPfky56NiFVNbreZmU1CmSP91cBgRJyOiOeAvcC6mjptQF8+3V9nuZmZzQJlHoy+GDhTmB8Cbq6pcwzoAD4BrAcWSFoYEd8DXirpYWAU2B4RD9TuQFIn0AnQ0tJCtVpt9H3YGEZGRvx52rRptK9Npn+6P09NmdBXnbKomd8M7JS0ETgCnCULeYCfiYhhSf8B+LykExHxtQs2FtED9AC0t7dHpVIp/w5sXNVqFX+eNi0ePNBwX2u4f05iH3ahMqE/BCwpzLcCw8UKETEM3AEgaT7QERHnCsuIiNOSqsAvAReEvpld/hbc0MWN9110ncfE7mtkHwC3Nb4P+4kyoX8UWCFpOdkR/F3Au4oVJC0CnoqIHwNbgN15+bXAMxHxo7zOG4Hfb2L7zWyW+MGp7Ty5vbFAbvRIf1nXgQZbZbUmPJEbEaPAJuAgcAq4PyJOSuqWdHterQIMSHoCaAG25uU3AA9LOkZ2gnd7zVU/ZmY2jcoc6RMRvUBvTdndhel9wL46630BuHGKbTQzsybxL3LNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIQ59M7OElAp9SWskDUgalHTRQzAlLZXUJ+m4pKqk1prlL5d0VtLOZjXczMwaN2HoS5oD7ALWAm3ABkltNdV2AHsiYiXQDWyrWf57wENTb66ZmU1FmSP91cBgRJyOiOeAvcC6mjptQF8+3V9cLukmsufmHpp6c83MbCrKhP5i4ExhfigvKzoGdOTT64EFkhZKegnwB8CHptpQMzObujIPRledsqiZ3wzslLQROAKcBUaB9wO9EXFGqreZfAdSJ9AJ0NLSQrVaLdEsK2NkZMSfp02bRvvaZPqn+/PUlAn9IWBJYb4VGC5WiIhh4A4ASfOBjog4J+kNwJskvR+YD8yTNBIRXTXr9wA9AO3t7VGpVCb5dqxWtVrFn6dNiwcPNNzXGu6fk9iHXahM6B8FVkhaTnYEfxfwrmIFSYuApyLix8AWYDdARPznQp2NQHtt4JuZ2fSZcEw/IkaBTcBB4BRwf0SclNQt6fa8WgUYkPQE2UnbrZeovWZmNgVljvSJiF6gt6bs7sL0PmDfBNu4F7i34RaamVnT+Be5ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJKXX1jplZGcu6DjS+0oPl17n6qrmNb98u4NA3s6Z4cvttDa+zrOvApNazyfPwjplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJKRX6ktZIGpA0KOmiZ9xKWiqpT9JxSVVJrYXyRyQ9KumkpP/e7DdgZmblTRj6kuYAu4C1QBuwQVJbTbUdwJ6IWAl0A9vy8m8BvxIRq4CbgS5Jr2pW483MrDFljvRXA4MRcToingP2Autq6rQBffl0//nlEfFcRPwoL7+y5P7MzOwSKRPCi4EzhfmhvKzoGNCRT68HFkhaCCBpiaTj+TbuiYjhqTXZzMwmq8ytlVWnLGrmNwM7JW0EjgBngVGAiDgDrMyHdR6QtC8ivn3BDqROoBOgpaWFarXayHuwcYyMjPjztFnN/XN6lQn9IWBJYb4VuOBoPT96vwNA0nygIyLO1daRdBJ4E7CvZlkP0APQ3t4elUqlsXdhY6pWq/jztFnrwQPun9OszPDOUWCFpOWS5gF3AfuLFSQtknR+W1uA3Xl5q6Sr8ulrgTcCA81qvJmZNWbC0I+IUWATcBA4BdwfEScldUu6Pa9WAQYkPQG0AFvz8huAL0s6BjwE7IiIE01+D2ZmVlKpxyVGRC/QW1N2d2F6HzVDNnn5YWDlFNtoZmZN4ksozcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwSUuoum2ZmkyXVe/heYfk99csjah/QZ83gI30zu6QiYsxXf3//mMvs0nDom5klpFToS1ojaUDSoKSuOsuXSuqTdFxSVVJrXr5K0hclncyX3dnsN2BmZuVNGPqS5gC7gLVAG7BBUltNtR3AnohYCXQD2/LyZ4D3RsSrgTXAxyVd06zGm5lZY8oc6a8GBiPidEQ8B+wF1tXUaQP68un+88sj4omI+Go+PQx8B7iuGQ03M7PGlQn9xcCZwvxQXlZ0DOjIp9cDCyQtLFaQtBqYB3xtck01M7OpKnPJZr3rrWpPrW8GdkraCBwBzgKjP9mAdD3wp8D7IuLHF+1A6gQ6AVpaWqhWq2XabiWMjIz487RZy/1z+pUJ/SFgSWG+FRguVsiHbu4AkDQf6IiIc/n8y4EDwG9HxJfq7SAieoAegPb29qhUKo29CxtTtVrFn6fNVu6f06/M8M5RYIWk5ZLmAXcB+4sVJC2SdH5bW4Ddefk84G/JTvL+VfOabWZmk6EyP4KQ9Dbg48AcYHdEbJXUDTwcEfslvYPsip0gG975QET8SNK7gT8BThY2tzEiHh1nX98FvjHpd2S1FgH/PNONMBuD+2fzLI2ICS+UKRX6dvmS9HBEtM90O8zqcf+cfv5FrplZQhz6ZmYJcei/+PXMdAPMxuH+Oc08pm9mlhAf6ZuZJcShb2YXkXSNpPdPct3flPRTzW6TNYdDfwZM9gslqdd3KbVpcg0wqdAHfhOYttDP7wRsJTn0Z0bdL9REnTci3hYR/3LJWlWSv2RJ2A78rKRHJX1M0ockHc2fi/FRAEkvk3RA0jFJj0m6U9IHgVcB/ZL6621Y0hxJ9+brnJD0W3n5z0n6XL69f5D0s8p8rFD3zrxuRVK/pE8DJ/Kyd0v6St7mP3Y/HcN4jzLz69K8yG5P/SzwKNltLvqBTwOP58sfAB4h+yVzZ2G9J8l+wbgMOAV8Kq9zCLhqnP19EHgcOA7szcvmk/1a+kRe3pGXb8jLHgPuKWxjhOxZCV8G/iNwE/BQ3s6DwPUz/bn61dQ+ugx4LJ++lewqG5EdKH4GeDPZnXU/VVjn6vzfJ4FF42z7JuBwYf6a/N8vA+vz6ZeS/bXQARwmuxtAC/BN4HqgAjwNLM/r3wD8HTA3n/+/ZM/ymPHPcra9ZrwBKb5qvlAXdN687BX5v1fl4bswny+G/iiwKi+/H3j3OPsbBq7Mp89/we4BPl6ocy3ZEdo3yZ55cAXweeDt+fIA3plPzwW+AFyXz99JdnuOGf9s/bokfXRH3vcezV+DwH8Ffh74et6X3lRYd6LQv5bsFuv/h+zhSi8BFgBDder+b+A3CvN/Ctyef2/6C+Wb8n5+vo0DwO/O9Oc4G19l7rJpl95XIuLrhfkPSlqfTy8BVgDfq1nn6/Hv9zB6hOxLOpbjwJ9LeoDsrwiA/0R28zwAIuL7kt4MVCPiuwCS/pzsiO4B4AXgr/PqvwC8BjgsCbKjsG+Ve6t2GRKwLSL++KIF0k3A24Btkg5FRPdEG8v72muBXwM+ALyT7DzAWPsey9M19e6LiC0T7T91HtOfHX7SeSVVyAL5DRHxWuAfyf7UrfWjwvQLjH+b7NvIHnl5E/CIpCvIviS1P9IY7wv2w4h4oVDvZESsyl83RsSt46xrl58fkB19QzZ89xv5bdORtFjSKyW9CngmIv6M7K+B19VZ9yKSFgEviYi/Bn4HeF1E/CswJOnteZ0r8yuAjgB35ucBriM7CPlKnc32Ae+Q9Mp8/VdIWjqVD+DFyqE/M8b7UlwNfD8inpH0i8Drp7Kj/JbXSyKiH/gw2Unk+WTnATYV6l1LNqb6lvxW2XPIxvcfqrPZAeA6SW/I150r6dVTaafNLhHxPeDvJT0G/CrZOacvSjoB7CPrvzcCX5H0KPA/gf+Vr94DfHasE7lkT96r5uvdS3Y7doD3kP2Ve5xs+PCnyW7Nfpzs6XyfBz4cEf9Up72PA78NHMrXP0w29m81/IvcGZJfdbCS7ITutyPi1/PyK8mGUxaThyvZ2GRV0pNAO1lofyYiXpOvsxmYHxG/W2c/c8lOFF9NdoT+ZxGxPT9qO3/0/wLw0Yj4G0nvIvsSCuiNiA/n2xmJiPmF7a4CPplv9wqy8wOfauJHZGaXgEPfzCwhPpFrZpeMpC8DV9YUvyciTsxEe8xH+i8qknYBb6wp/kRE/MlMtMfMZh+HvplZQnz1jplZQhz6ZmYJceibmSXEoW9mlhCHvplZQv4N/UFnrL4Bt0AAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x25635336358>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# %load solutions/03_solutions.py\n",
    "pipe = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000))\n",
    "scores = cross_validate(pipe, X_breast, y_breast, scoring='balanced_accuracy', cv=3, return_train_score=True)\n",
    "df_scores = pd.DataFrame(scores)\n",
    "df_scores[['train_score', 'test_score']].boxplot()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Hyper-parameters optimization: fine-tune the inside of a pipeline\n",
    "\n",
    "## 4.超参数优化：微调管道内部"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes you would like to find the parameters of a component of the pipeline which lead to the best accuracy. We already saw that we could check the parameters of a pipeline using `get_params()`.\n",
    "\n",
    "有时您希望找到管道组件的参数，从而获得最佳精度。 我们已经看到我们可以使用`get_params()`检查管道的参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'memory': None,\n",
       " 'sgdclassifier': SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n",
       "        early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n",
       "        l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000,\n",
       "        n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',\n",
       "        power_t=0.5, random_state=None, shuffle=True, tol=None,\n",
       "        validation_fraction=0.1, verbose=0, warm_start=False),\n",
       " 'sgdclassifier__alpha': 0.0001,\n",
       " 'sgdclassifier__average': False,\n",
       " 'sgdclassifier__class_weight': None,\n",
       " 'sgdclassifier__early_stopping': False,\n",
       " 'sgdclassifier__epsilon': 0.1,\n",
       " 'sgdclassifier__eta0': 0.0,\n",
       " 'sgdclassifier__fit_intercept': True,\n",
       " 'sgdclassifier__l1_ratio': 0.15,\n",
       " 'sgdclassifier__learning_rate': 'optimal',\n",
       " 'sgdclassifier__loss': 'hinge',\n",
       " 'sgdclassifier__max_iter': 1000,\n",
       " 'sgdclassifier__n_iter': None,\n",
       " 'sgdclassifier__n_iter_no_change': 5,\n",
       " 'sgdclassifier__n_jobs': None,\n",
       " 'sgdclassifier__penalty': 'l2',\n",
       " 'sgdclassifier__power_t': 0.5,\n",
       " 'sgdclassifier__random_state': None,\n",
       " 'sgdclassifier__shuffle': True,\n",
       " 'sgdclassifier__tol': None,\n",
       " 'sgdclassifier__validation_fraction': 0.1,\n",
       " 'sgdclassifier__verbose': 0,\n",
       " 'sgdclassifier__warm_start': False,\n",
       " 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),\n",
       " 'standardscaler__copy': True,\n",
       " 'standardscaler__with_mean': True,\n",
       " 'standardscaler__with_std': True,\n",
       " 'steps': [('standardscaler',\n",
       "   StandardScaler(copy=True, with_mean=True, with_std=True)),\n",
       "  ('sgdclassifier',\n",
       "   SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n",
       "          early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n",
       "          l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000,\n",
       "          n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',\n",
       "          power_t=0.5, random_state=None, shuffle=True, tol=None,\n",
       "          validation_fraction=0.1, verbose=0, warm_start=False))]}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hyper-parameters can be optimized by an exhaustive search. [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) provides such utility and does a cross-validated grid-search over a parameter grid.\n",
    "\n",
    "Let's give an example in which we would like to optimize the `C` and `penalty` parameters of the `LogisticRegression` classifier.\n",
    "\n",
    "可以通过穷举搜索来优化超参数。`GridSearchCV`提供此类实用程序，并通过参数网格进行交叉验证的网格搜索。\n",
    "\n",
    "如下例子，我们希望优化`LogisticRegression`分类器的`C`和`penalty`参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score='raise-deprecating',\n",
       "       estimator=Pipeline(memory=None,\n",
       "     steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=5000, multi_class='auto',\n",
       "          n_jobs=None, penalty='l2', random_state=42, solver='saga',\n",
       "          tol=0.0001, verbose=0, warm_start=False))]),\n",
       "       fit_params=None, iid='warn', n_jobs=-1,\n",
       "       param_grid={'logisticregression__C': [0.1, 1.0, 10], 'logisticregression__penalty': ['l2', 'l1']},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring=None, verbose=0)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "pipe = make_pipeline(MinMaxScaler(),\n",
    "                     LogisticRegression(solver='saga', multi_class='auto',\n",
    "                                        random_state=42, max_iter=5000))\n",
    "param_grid = {'logisticregression__C': [0.1, 1.0, 10],\n",
    "              'logisticregression__penalty': ['l2', 'l1']}\n",
    "grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, return_train_score=True)\n",
    "grid.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When fitting the grid-search object, it finds the best possible parameter combination on the training set (using cross-validation). We can introspect the results of the grid-search by accessing the attribute `cv_results_`. It allows us to check the effect of the parameters on the model performance.\n",
    "\n",
    "在拟合网格搜索对象时，它会在训练集上找到最佳的参数组合（使用交叉验证）。 我们可以通过访问属性`cv_results_`来得到网格搜索的结果。 通过这个属性允许我们可以检查参数对模型性能的影响"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_fit_time</th>\n",
       "      <th>mean_score_time</th>\n",
       "      <th>mean_test_score</th>\n",
       "      <th>mean_train_score</th>\n",
       "      <th>param_logisticregression__C</th>\n",
       "      <th>param_logisticregression__penalty</th>\n",
       "      <th>params</th>\n",
       "      <th>rank_test_score</th>\n",
       "      <th>split0_test_score</th>\n",
       "      <th>split0_train_score</th>\n",
       "      <th>split1_test_score</th>\n",
       "      <th>split1_train_score</th>\n",
       "      <th>split2_test_score</th>\n",
       "      <th>split2_train_score</th>\n",
       "      <th>std_fit_time</th>\n",
       "      <th>std_score_time</th>\n",
       "      <th>std_test_score</th>\n",
       "      <th>std_train_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.480382</td>\n",
       "      <td>0.040559</td>\n",
       "      <td>0.942836</td>\n",
       "      <td>0.955084</td>\n",
       "      <td>0.1</td>\n",
       "      <td>l2</td>\n",
       "      <td>{'logisticregression__C': 0.1, 'logisticregres...</td>\n",
       "      <td>5</td>\n",
       "      <td>0.951542</td>\n",
       "      <td>0.952968</td>\n",
       "      <td>0.935268</td>\n",
       "      <td>0.959956</td>\n",
       "      <td>0.941573</td>\n",
       "      <td>0.952328</td>\n",
       "      <td>0.056607</td>\n",
       "      <td>1.155489e-02</td>\n",
       "      <td>0.006717</td>\n",
       "      <td>0.003455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.089751</td>\n",
       "      <td>0.026928</td>\n",
       "      <td>0.892353</td>\n",
       "      <td>0.903496</td>\n",
       "      <td>0.1</td>\n",
       "      <td>l1</td>\n",
       "      <td>{'logisticregression__C': 0.1, 'logisticregres...</td>\n",
       "      <td>6</td>\n",
       "      <td>0.885463</td>\n",
       "      <td>0.905935</td>\n",
       "      <td>0.908482</td>\n",
       "      <td>0.902113</td>\n",
       "      <td>0.883146</td>\n",
       "      <td>0.902439</td>\n",
       "      <td>0.279431</td>\n",
       "      <td>8.141014e-04</td>\n",
       "      <td>0.011425</td>\n",
       "      <td>0.001730</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.362025</td>\n",
       "      <td>0.019614</td>\n",
       "      <td>0.963623</td>\n",
       "      <td>0.986634</td>\n",
       "      <td>1</td>\n",
       "      <td>l2</td>\n",
       "      <td>{'logisticregression__C': 1.0, 'logisticregres...</td>\n",
       "      <td>2</td>\n",
       "      <td>0.977974</td>\n",
       "      <td>0.985442</td>\n",
       "      <td>0.955357</td>\n",
       "      <td>0.987764</td>\n",
       "      <td>0.957303</td>\n",
       "      <td>0.986696</td>\n",
       "      <td>0.044381</td>\n",
       "      <td>1.316415e-02</td>\n",
       "      <td>0.010263</td>\n",
       "      <td>0.000949</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.539531</td>\n",
       "      <td>0.000665</td>\n",
       "      <td>0.953229</td>\n",
       "      <td>0.978837</td>\n",
       "      <td>1</td>\n",
       "      <td>l1</td>\n",
       "      <td>{'logisticregression__C': 1.0, 'logisticregres...</td>\n",
       "      <td>4</td>\n",
       "      <td>0.964758</td>\n",
       "      <td>0.977604</td>\n",
       "      <td>0.950893</td>\n",
       "      <td>0.977753</td>\n",
       "      <td>0.943820</td>\n",
       "      <td>0.981153</td>\n",
       "      <td>1.073243</td>\n",
       "      <td>4.699094e-04</td>\n",
       "      <td>0.008710</td>\n",
       "      <td>0.001639</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.466066</td>\n",
       "      <td>0.000998</td>\n",
       "      <td>0.968077</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>10</td>\n",
       "      <td>l2</td>\n",
       "      <td>{'logisticregression__C': 10, 'logisticregress...</td>\n",
       "      <td>1</td>\n",
       "      <td>0.977974</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.962054</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.964045</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.227521</td>\n",
       "      <td>5.947204e-07</td>\n",
       "      <td>0.007103</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>9.693750</td>\n",
       "      <td>0.000665</td>\n",
       "      <td>0.960653</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>10</td>\n",
       "      <td>l1</td>\n",
       "      <td>{'logisticregression__C': 10, 'logisticregress...</td>\n",
       "      <td>3</td>\n",
       "      <td>0.973568</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.957589</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.950562</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.904692</td>\n",
       "      <td>4.704713e-04</td>\n",
       "      <td>0.009643</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_fit_time  mean_score_time  mean_test_score  mean_train_score  \\\n",
       "0       0.480382         0.040559         0.942836          0.955084   \n",
       "1       1.089751         0.026928         0.892353          0.903496   \n",
       "2       1.362025         0.019614         0.963623          0.986634   \n",
       "3       4.539531         0.000665         0.953229          0.978837   \n",
       "4       3.466066         0.000998         0.968077          1.000000   \n",
       "5       9.693750         0.000665         0.960653          1.000000   \n",
       "\n",
       "  param_logisticregression__C param_logisticregression__penalty  \\\n",
       "0                         0.1                                l2   \n",
       "1                         0.1                                l1   \n",
       "2                           1                                l2   \n",
       "3                           1                                l1   \n",
       "4                          10                                l2   \n",
       "5                          10                                l1   \n",
       "\n",
       "                                              params  rank_test_score  \\\n",
       "0  {'logisticregression__C': 0.1, 'logisticregres...                5   \n",
       "1  {'logisticregression__C': 0.1, 'logisticregres...                6   \n",
       "2  {'logisticregression__C': 1.0, 'logisticregres...                2   \n",
       "3  {'logisticregression__C': 1.0, 'logisticregres...                4   \n",
       "4  {'logisticregression__C': 10, 'logisticregress...                1   \n",
       "5  {'logisticregression__C': 10, 'logisticregress...                3   \n",
       "\n",
       "   split0_test_score  split0_train_score  split1_test_score  \\\n",
       "0           0.951542            0.952968           0.935268   \n",
       "1           0.885463            0.905935           0.908482   \n",
       "2           0.977974            0.985442           0.955357   \n",
       "3           0.964758            0.977604           0.950893   \n",
       "4           0.977974            1.000000           0.962054   \n",
       "5           0.973568            1.000000           0.957589   \n",
       "\n",
       "   split1_train_score  split2_test_score  split2_train_score  std_fit_time  \\\n",
       "0            0.959956           0.941573            0.952328      0.056607   \n",
       "1            0.902113           0.883146            0.902439      0.279431   \n",
       "2            0.987764           0.957303            0.986696      0.044381   \n",
       "3            0.977753           0.943820            0.981153      1.073243   \n",
       "4            1.000000           0.964045            1.000000      0.227521   \n",
       "5            1.000000           0.950562            1.000000      0.904692   \n",
       "\n",
       "   std_score_time  std_test_score  std_train_score  \n",
       "0    1.155489e-02        0.006717         0.003455  \n",
       "1    8.141014e-04        0.011425         0.001730  \n",
       "2    1.316415e-02        0.010263         0.000949  \n",
       "3    4.699094e-04        0.008710         0.001639  \n",
       "4    5.947204e-07        0.007103         0.000000  \n",
       "5    4.704713e-04        0.009643         0.000000  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_grid = pd.DataFrame(grid.cv_results_)\n",
    "df_grid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the grid-search object is also behaving as an estimator. Once it is fitted, calling `score` will fix the hyper-parameters to the best parameters found.\n",
    "\n",
    "默认情况下，网格搜索对象也表现为估计器。 一旦它被`fit`后，调用`score`将超参数固定为找到的最佳参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'logisticregression__C': 10, 'logisticregression__penalty': 'l2'}"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides this is possible to call the grid-search as any other classifier to make predictions.\n",
    "\n",
    "此外，可以将网格搜索称为任何其他分类器以进行预测。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the GridSearchCV is 0.96\n"
     ]
    }
   ],
   "source": [
    "accuracy = grid.score(X_test, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(grid.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Up to know, we only make the fitting of the grid-search on a single split. However, as previously stated, we might be interested to make an outer cross-validation to estimate the performance of the model and different sample of data and check the potential variation in performance. Since grid-search is an estimator, we can use it directly within the `cross_validate` function.\n",
    "\n",
    "最重要的是，我们只对单个分割进行网格搜索。 但是，如前所述，我们可能有兴趣进行外部交叉验证，以估计模型的性能和不同的数据样本，并检查性能的潜在变化。 由于网格搜索是一个估计器，我们可以直接在`cross_validate`函数中使用它。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fit_time</th>\n",
       "      <th>score_time</th>\n",
       "      <th>test_score</th>\n",
       "      <th>train_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32.169982</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.985774</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>33.664034</td>\n",
       "      <td>0.000999</td>\n",
       "      <td>0.946578</td>\n",
       "      <td>0.997496</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>31.734155</td>\n",
       "      <td>0.000998</td>\n",
       "      <td>0.924497</td>\n",
       "      <td>0.993339</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    fit_time  score_time  test_score  train_score\n",
       "0  32.169982    0.000000    0.928571     0.985774\n",
       "1  33.664034    0.000999    0.946578     0.997496\n",
       "2  31.734155    0.000998    0.924497     0.993339"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores = cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True)\n",
    "df_scores = pd.DataFrame(scores)\n",
    "df_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Reuse the previous pipeline for the breast dataset and make a grid-search to evaluate the difference between a `hinge` and `log` loss. Besides, fine-tune the `penalty`.\n",
    "\n",
    "#### 练习\n",
    "重复使用乳腺癌数据集的先前管道并进行网格搜索以评估`hinge`(铰链) 和`log`(对数)损失之间的差异。此外，微调`penalty`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n",
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n",
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sgdclassifier__loss': 'log', 'sgdclassifier__penalty': 'l2'}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD9CAYAAABQvqc9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFapJREFUeJzt3X+wXOV93/H3x+KHiSGYIHJrIyLRmLSoRsb2DY7j2r721C7EHTAoY0Nqd5h0Ru04jCedwY4YtzhWy4Br0tYpdCbylBiapJRREg+xZH5E0ZpO7dhAg4QFIypjYoRc13ETxRfjYNFv/9ijZFmudHfvXelKet6vmR2d85znOefZo7OfPfc5u3tSVUiS2vCype6AJOnIMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTlhqTswbPny5bVq1aql7sZx49lnn+UVr3jFUndDmpPH5+Q8/PDDf1ZVZ81X76gL/VWrVvHQQw8tdTeOG71ej5mZmaXuhjQnj8/JSfKno9RzeEeSGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUkKPuy1lamCRjt/H+yFJ7PNM/TlTVnI+Vv/L5gy6T1B5DX5IaYuhLUkMMfUlqiKEvSQ3x0zvHmNd94j72PffDsdqsWr95rPqnn3Ii2z/+7rHaSDo2GPrHmH3P/ZCnbnrPyPUX8nvl475JSDp2OLwjSQ0x9CWpIYa+JDXEMf1jzGnnr+eC29eP1+j2cbcBMPp1A0nHDkP/GPO9x2/yQq6kBXN4R5IaYuhLUkMMfUlqyEihn+TiJLuS7E7ykquISVYm2ZpkR5JekhUDyz6Z5Gvd4/2T7LwkaTzzhn6SZcCtwCXAauCqJKuHqt0M3FFVa4ANwI1d2/cAbwAuBN4EfCTJj06u+5KkcYxypn8RsLuqnqyq54E7gcuG6qwGtnbT2waWrwa+WFX7q+pZYDtw8eK7LUlaiFE+snk28PTA/B76Z+2DtgNrgU8DlwOnJTmzK/94kn8H/AjwDuCx4Q0kWQesA5iamqLX6433LBozzv6ZnZ1d0P70/0BHwkKPTy3cKKE/181Xh++1dy1wS5KrgQeAZ4D9VXVfkp8GvgR8B/gysP8lK6vaCGwEmJ6ernE/V96UezaP9bn7hXxOf9xtSAu1oONTizLK8M4e4JyB+RXA3sEKVbW3qq6oqtcDH+vK9nX/3lBVF1bVu+i/gfyvifRckjS2UUL/QeC8JOcmOQm4Erh7sEKS5UkOrOs64LaufFk3zEOSNcAa4L5JdV6SNJ55h3eqan+Sa4B7gWXAbVW1M8kG4KGquhuYAW5MUvSHd36pa34i8N+TAPwl8IGqesnwjiTpyBjpt3eqaguwZajs+oHpTcCmOdr9gP4neDRBY/82zj3j3zlL0vHJH1w7xozzY2vQf4MYt42k45c/wyBJDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0JekhviDa8eJ7uer5172ybnLq4ZvgCbpeOeZ/nGiquZ8bNu27aDLJLXH0Jekhhj6ktQQQ1+SGmLoS1JDDH1JashIoZ/k4iS7kuxOsn6O5SuTbE2yI0kvyYqBZf82yc4kjyf59Rzqs4WSpMNq3tBPsgy4FbgEWA1clWT1ULWbgTuqag2wAbixa/uzwFuANcBrgZ8G3j6x3kuSxjLKmf5FwO6qerKqngfuBC4bqrMa2NpNbxtYXsDLgZOAk4ETgW8vttOSpIUZJfTPBp4emN/TlQ3aDqztpi8HTktyZlV9mf6bwLe6x71V9fjiuixJWqhRfoZhrjH44a9zXgvckuRq4AHgGWB/ktcA5wMHxvjvT/K2qnrgRRtI1gHrAKampuj1eiM/AR3a7Oys+1NHLY/PI2+U0N8DnDMwvwLYO1ihqvYCVwAkORVYW1X7ujD/46qa7ZZ9AfgZ+m8Mg+03AhsBpqena2ZmZkFPRi/V6/Vwf+po5fF55I0yvPMgcF6Sc5OcBFwJ3D1YIcnyJAfWdR1wWzf9TeDtSU5IciL9i7gO70jSEpk39KtqP3ANcC/9wL6rqnYm2ZDk0q7aDLAryRPAFHBDV74J+DrwKP1x/+1V9QeTfQqSpFGN9NPKVbUF2DJUdv3A9Cb6AT/c7gXgny2yj5KkCfEbuZLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDRrpHriQtVJIFtauqCfdE4Jm+pMOsqg76WPkrnz/oMh0eI4V+kouT7EqyO8n6OZavTLI1yY4kvSQruvJ3JHlk4PGDJO+d9JOQJI1m3tBPsgy4FbgEWA1clWT1ULWbgTuqag2wAbgRoKq2VdWFVXUh8E7g+8B9E+y/JGkMo5zpXwTsrqonq+p54E7gsqE6q4Gt3fS2OZYD/Dzwhar6/kI7K0lanFFC/2zg6YH5PV3ZoO3A2m76cuC0JGcO1bkS+K8L6aQkaTJG+fTOXJfeh6+yXAvckuRq4AHgGWD/X68geRVwAXDvnBtI1gHrAKampuj1eiN0S6OYnZ11f+qo5vF5ZI0S+nuAcwbmVwB7BytU1V7gCoAkpwJrq2rfQJX3Ab9fVT+cawNVtRHYCDA9PV0zMzOj9l/z6PV6uD911Lpns8fnETbK8M6DwHlJzk1yEv1hmrsHKyRZnuTAuq4Dbhtax1U4tCNJS27e0K+q/cA19IdmHgfuqqqdSTYkubSrNgPsSvIEMAXccKB9klX0/1L44kR7Lkka20jfyK2qLcCWobLrB6Y3AZsO0vYpXnrhV5K0BPxGriQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGeGN0SRPxuk/cx77n5vwh3UNatX7zyHVPP+VEtn/83WNvQ3/D0Jc0Efue+yFP3fSesdqM+9Pf47xBaG4O70hSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkNGCv0kFyfZlWR3kvVzLF+ZZGuSHUl6SVYMLPuJJPcleTzJY0lWTa77kqRxzBv6SZYBtwKXAKuBq5KsHqp2M3BHVa0BNgA3Diy7A/hUVZ0PXAT8n0l0XJI0vlHO9C8CdlfVk1X1PHAncNlQndXA1m5624Hl3ZvDCVV1P0BVzVbV9yfSc0nS2Eb5wbWzgacH5vcAbxqqsx1YC3wauBw4LcmZwE8Bf5Hk94BzgT8E1lfVC4ONk6wD1gFMTU3R6/XGfyaa0+zsrPtTR8y4x9pCjk+P58UZJfQzR1kNzV8L3JLkauAB4Blgf7f+twKvB74J/DfgauA/v2hlVRuBjQDT09M1zq/u6dDG/RVDacHu2Tz2sTb28bmAbejFRhne2QOcMzC/Atg7WKGq9lbVFVX1euBjXdm+ru2fdEND+4HPAW+YSM8lSWMbJfQfBM5Lcm6Sk4ArgbsHKyRZnuTAuq4Dbhtoe0aSs7r5dwKPLb7bkqSFmDf0uzP0a4B7gceBu6pqZ5INSS7tqs0Au5I8AUwBN3RtX6A/9LM1yaP0h4o+M/FnIUkayUh3zqqqLcCWobLrB6Y3AZsO0vZ+YM0i+ihJmhC/kStJDfEeuZIm4rTz13PB7S/5wv78bh9nGwDj3YdXL2boS5qI7z1+kzdGPwY4vCNJDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNGSn0k1ycZFeS3Ulecj+0JCuTbE2yI0kvyYqBZS8keaR73D3JzkuSxjPv7RKTLANuBd4F7AEeTHJ3VT02UO1m4I6quj3JO4EbgQ92y56rqgsn3G9J0gKMcqZ/EbC7qp6squeBO4HLhuqsBrZ209vmWC5JOgqMEvpnA08PzO/pygZtB9Z205cDpyU5s5t/eZKHkvxxkvcuqreSpEWZd3gHyBxlNTR/LXBLkquBB4BngP3dsp+oqr1J/jbwR0keraqvv2gDyTpgHcDU1BS9Xm/0Z6BDmp2ddX/qiBn3WFvI8enxvDijhP4e4JyB+RXA3sEKVbUXuAIgyanA2qraN7CMqnoySQ94PfD1ofYbgY0A09PTNTMzs4Cnorn0ej3cnzoi7tk89rE29vG5gG3oxUYZ3nkQOC/JuUlOAq4EXvQpnCTLkxxY13XAbV35GUlOPlAHeAsweAFYknQEzRv6VbUfuAa4F3gcuKuqdibZkOTSrtoMsCvJE8AUcENXfj7wUJLt9C/w3jT0qR9J0hE0yvAOVbUF2DJUdv3A9CZg0xztvgRcsMg+SpImxG/kSlJDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGjPTTypI0ilXrN4/f6J7R25x+yonjr18vYuhLmoinbnrP2G1Wrd+8oHZaOId3JKkhhr4kNcTQl6SGGPqS1BBDX5IaMlLoJ7k4ya4ku5Osn2P5yiRbk+xI0kuyYmj5jyZ5Jsktk+q4JGl884Z+kmXArcAlwGrgqiSrh6rdDNxRVWuADcCNQ8v/NfDFxXdXkrQYo5zpXwTsrqonq+p54E7gsqE6q4Gt3fS2weVJ3ghMAfctvruSpMUYJfTPBp4emN/TlQ3aDqztpi8HTktyZpKXAb8GfGSxHZUkLd4o38jNHGU1NH8tcEuSq4EHgGeA/cCHgC1V9XQy12q6DSTrgHUAU1NT9Hq9EbqlUczOzro/dVTz+DyyRgn9PcA5A/MrgL2DFapqL3AFQJJTgbVVtS/Jm4G3JvkQcCpwUpLZqlo/1H4jsBFgenq6ZmZmFvh0NKzX6+H+1FHrns0en0fYKKH/IHBeknPpn8FfCfzCYIUky4H/W1X/D7gOuA2gqv7xQJ2rgenhwJckHTnzjulX1X7gGuBe4HHgrqramWRDkku7ajPAriRP0L9oe8Nh6q8kaRFG+pXNqtoCbBkqu35gehOwaZ51fBb47Ng9lCRNjN/IlaSGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWrISKGf5OIku5LsTrJ+juUrk2xNsiNJL8mKgfKHkzySZGeSfz7pJyBJGt28oZ9kGXArcAmwGrgqyeqhajcDd1TVGmADcGNX/i3gZ6vqQuBNwPokr55U5yVJ4xnlTP8iYHdVPVlVzwN3ApcN1VkNbO2mtx1YXlXPV9VfdeUnj7g9SdJhcsIIdc4Gnh6Y30P/rH3QdmAt8GngcuC0JGdW1XeTnANsBl4DfKSq9g5vIMk6YB3A1NQUvV5v3Oehg5idnXV/6qjm8XlkjRL6maOshuavBW5JcjXwAPAMsB+gqp4G1nTDOp9Lsqmqvv2ilVVtBDYCTE9P18zMzDjPQYfQ6/Vwf2opJXNFyN94xyfnLq8ajhlNwijDLXuAcwbmVwAvOluvqr1VdUVVvR74WFe2b7gOsBN466J6LOmYUlUHfWzbtu2gy3R4jBL6DwLnJTk3yUnAlcDdgxWSLE9yYF3XAbd15SuSnNJNnwG8Bdg1qc5LksYzb+hX1X7gGuBe4HHgrqramWRDkku7ajPAriRPAFPADV35+cBXkmwHvgjcXFWPTvg5SJJGNMqYPlW1BdgyVHb9wPQmYNMc7e4H1iyyj5KkCfEjlJLUEENfkhpi6EtSQwx9SWqIoS9JDcnR9iWIJN8B/nSp+3EcWQ782VJ3QjoIj8/JWVlVZ81X6agLfU1Wkoeqanqp+yHNxePzyHN4R5IaYuhLUkMM/ePfxqXugHQIHp9HmGP6ktQQz/QlqSGGvqSXSPLKJB9aYNtfTvIjk+6TJsPQXwILfUEl2ZLklYejT9KQVwILCn3gl4EjFvpJlh2pbR0PDP2lMecLar6Dt6p+rqr+4rD1akS+yJpwE/CTSR5J8qkkH0nyYJIdST4BkOQVSTYn2Z7ka0nen+TDwKuBbUm2zbXiJMuSfLZr82iSf9GVvybJH3br+59JfjJ9nxqo+/6u7kySbUl+B3i0K/tAkq92ff4Nj9ODONStzHwcngdwJ/Ac8Aj9O5NtA34HeKxb/jngYfq3l1w30O4p+t9gXEX/hjaf6ercB5xyiO19GHgM2AHc2ZWdCvwm/RfMDmBtV35VV/Y14JMD65gFNgBfAf4+8Eb6N8Z5mP4Ndl611PvVx0SP0VXA17rpd9P/lE3onyh+HngbsBb4zECb07t/nwKWH2LdbwTuH5h/ZffvV4DLu+mX0/9rYS1wP7CM/g2avgm8iv6Nm54Fzu3qnw/8AXBiN/+fgH+y1PvxaHwseQdafAy9oF508HZlP9b9e0oXvmd284Ohvx+4sCu/C/jAIba3Fzi5mz7wAvsk8B8G6pxB/wztm8BZ9G+w80fAe7vlBbyvmz4R+BJwVjf/fuC2pd6vPg7bMXpzd+w90j12A/8U+CngG92x9NaBtvOF/hnA14H/CFzcvZGcBuyZo+6/B35xYP6/AJd2r5ttA+XXdMf5gT7uAn51qffj0fgY6c5ZOuy+WlXfGJj/cJLLu+lzgPOA7w61+UZVPdJNP0z/RXowO4DfTvI5+n9FAPwD+vc7BqCq/jzJ24BeVX0HIMlv0z+j+xzwAvC7XfW/A7wWuD8J9M/CvjXaU9UxKMCNVfUbL1mQvBH4OeDGJPdV1Yb5VtYda68D/iHwS8D76F8HONi2D+bZoXq3V9V1822/dY7pHx3++uBNMkM/kN9cVa8D/oT+n7rD/mpg+gUOfevL9wC30v+z+uEkJ9B/kQx/SeNQL7AfVNULA/V2VtWF3eOCqnr3Idrq2PM9+mff0B+++8UkpwIkOTvJjyd5NfD9qvot+n8NvGGOti+RZDnwsqr6XeBfAW+oqr8E9iR5b1fn5O4TQA8A7++uA5xF/yTkq3Osdivw80l+vGv/Y0lWLmYHHK8M/aVxqBfF6cCfV9X3k/xd4GcWs6EkLwPOqaptwEfpX0Q+lf51gGsG6p1Bf0z17UmWdxfBrqI/bj9sF3BWkjd3bU9M8vcW008dXarqu8D/SPI14F30rzl9Ocmj9O+HfRpwAfDVJI8AHwP+Tdd8I/CFg13IBc4Gel27zwIHzs4/SP+v3B30hw//FvD79P9S3U5/uPGjVfW/5+jvY8C/BO7r2t9Pf+xfQ/xG7hLpPnWwhv4F3W9X1T/qyk+mP5xyNl240h+b7CV5CpimH9qfr6rXdm2uBU6tql+dYzsn0r9QfDr9M/TfqqqburO2A2f/LwCfqKrfS/IL9F+EAbZU1Ue79cxW1akD670Q+PVuvSfQvz7wmQnuIkmHgaEvSQ3xQq6kwybJV4CTh4o/WFWPLkV/5Jn+cSXJrcBbhoo/XVW/uRT9kXT0MfQlqSF+ekeSGmLoS1JDDH1JaoihL0kNMfQlqSH/HzYprOMsmJ4CAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x2563541a1d0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# %load solutions/04_solutions.py\n",
    "pipe = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000))\n",
    "param_grid = {'sgdclassifier__loss': ['hinge', 'log'],\n",
    "              'sgdclassifier__penalty': ['l2', 'l1']}\n",
    "grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)\n",
    "scores = cross_validate(grid, X_breast, y_breast, scoring='balanced_accuracy', cv=3, return_train_score=True)\n",
    "df_scores = pd.DataFrame(scores)\n",
    "df_scores[['train_score', 'test_score']].boxplot()\n",
    "\n",
    "grid.fit(X_breast_train, y_breast_train)\n",
    "print(grid.best_params_)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Summary: my scikit-learn pipeline in less than 10 lines of code (skipping the import statements)\n",
    "## 5.总结：我的scikit-learn管道只有不到10行代码（跳过import语句）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x256353f2978>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD9CAYAAABQvqc9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFo1JREFUeJzt3X+wXOV93/H3x0JgYhHAiNzakiLRhLSoRib2Dbbj2r7xNBhMBgzK2ODaCU1n1I7NeNIZ7Ihpim0ljKAm0yaFpJGnClDHoYySeIiRDVTRmk79SyJGwkIVI2NihFzHMY7iCzgg8u0fe5SuVivdvT9070Xn/ZrZ0TnP85xznj06+9mzz9m9J1WFJKkdXjbXHZAkzR5DX5JaxNCXpBYx9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqkZMmapBkI/ALwF9V1WsG1Af4beCdwLPANVX1F03dLwO/3jT9zaq6Y6LtLV68uFasWDH0E9CxPfPMM7ziFa+Y625IA3l8zpyHHnror6vq7InaTRj6wO3ArcCdR6m/BDi3ebwB+D3gDUleCXwUGAUKeCjJPVX1/WNtbMWKFWzfvn2IbmkYnU6HsbGxue6GNJDH58xJ8pfDtJtweKeqHgSePkaTy4E7q+vLwBlJXgW8A3igqp5ugv4B4OJhOiVJOj5mYkx/CfBkz/y+puxo5ZKkOTLM8M5EMqCsjlF+5AqSNcAagJGRETqdzgx0SwDj4+PuT81bHp+zbyZCfx+wrGd+KbC/KR/rK+8MWkFVbQA2AIyOjpZjfDPHMVPNZx6fs28mhnfuAX4pXW8EDlTVt4H7gIuSnJnkTOCipkySNEeG+crmH9E9Y1+cZB/db+QsBKiq/wpspvt1zb10v7L5r5q6p5P8BrCtWdW6qjrWBWFJ0nE2YehX1dUT1BfwwaPUbQQ2Tq1rkqSZ5i9yJalFZuJCruaB7g+jJ8f7I0vt45n+CaKqBj6W/9pnj1onqX0MfUlqEUNfklrE0JekFjH0JalF/PbOS8xrP34/B557YVLLrFh776Tan37qQnZ89KJJLSPppcHQf4k58NwLPHHTpUO3n8rfNpnsm4Sklw6HdySpRQx9SWoRQ1+SWsTQl6QW8ULuS8xp563l/DvWTm6hOya7DYDhLxZLeukw9F9ifrD7Jr+9I2nKHN6RpBYx9CWpRQx9SWqRoUI/ycVJ9iTZm+SIq4hJlifZkmRnkk6SpT11Nyf5evN4z0x2XpI0OROGfpIFwG3AJcBK4OokK/ua3QLcWVWrgHXA+mbZS4HXARcAbwA+nORHZ677kqTJGOZM/0Jgb1U9XlXPA3cBl/e1WQlsaaa39tSvBL5QVQer6hlgB3Dx9LstSZqKYUJ/CfBkz/y+pqzXDmB1M30FcFqSs5ryS5L8SJLFwM8By6bXZUnSVA3zPf1Bd9zuv8HqdcCtSa4BHgSeAg5W1f1Jfgb4IvBd4EvAwSM2kKwB1gCMjIzQ6XSG7X8rTWb/jI+PT2l/+n+g2TDV41NTl4lukJ3kTcDHquodzfz1AFW1/ijtFwH/p6qWDqj7NPCpqtp8tO2Njo7W9u3bh38GLTMbP5zy7+lrtkzlx4MaLMlDVTU6UbthzvS3AecmOYfuGfxVwHv7NrYYeLqq/h64HtjYlC8Azqiq7yVZBawC7p/UM9FhJvNrXOi+SUx2GUknrglDv6oOJrkWuA9YAGysql1J1gHbq+oeYAxYn6ToDu98sFl8IfC/kgD8LfC+qjpieEeSNDuG+ts7zXDM5r6yG3qmNwGbBiz3Q7rf4JEkzQP+IleSWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFvF2iSeI5rcQg+tuHlw+0a+xJZ14PNM/QVTVwMfWrVuPWiepfQx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JahFDX5JaZKjQT3Jxkj1J9iZZO6B+eZItSXYm6SRZ2lP3H5PsSrI7ye/kWH8kRpJ0XE0Y+kkWALcBl9C93+3VSfrve3sLcGdVrQLWAeubZX8WeDOwCngN8DPA22as95KkSRnmTP9CYG9VPV5VzwN3AZf3tVkJbGmmt/bUF/By4GTgFGAh8J3pdlqSNDXDhP4S4Mme+X1NWa8dwOpm+grgtCRnVdWX6L4JfLt53FdVu6fXZUnSVA3z9/QHjcH3/13e64Bbk1wDPAg8BRxM8pPAecChMf4Hkry1qh48bAPJGmANwMjICJ1OZ+gnoGMbHx93f2re8vicfcOE/j5gWc/8UmB/b4Oq2g9cCZBkEbC6qg40Yf7lqhpv6j4HvJHuG0Pv8huADQCjo6M1NjY2pSejI3U6Hdyfmq88PmffMMM724Bzk5yT5GTgKuCe3gZJFic5tK7rgY3N9LeAtyU5KclCuhdxHd6RpDkyYehX1UHgWuA+uoF9d1XtSrIuyWVNszFgT5LHgBHgxqZ8E/AN4BG64/47qurPZvYpSJKGNdQ9cqtqM7C5r+yGnulNdAO+f7kXgX8zzT5KkmaIv8iVpBYx9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUWGCv0kFyfZk2RvkrUD6pcn2ZJkZ5JOkqVN+c8lebjn8cMk75rpJyFJGs6EoZ9kAXAbcAmwErg6ycq+ZrcAd1bVKmAdsB6gqrZW1QVVdQHwduBZ4P4Z7L8kaRKGOdO/ENhbVY9X1fPAXcDlfW1WAlua6a0D6gF+EfhcVT071c5KkqZnmNBfAjzZM7+vKeu1A1jdTF8BnJbkrL42VwF/NJVOSpJmxklDtMmAsuqbvw64Nck1wIPAU8DBf1hB8irgfOC+gRtI1gBrAEZGRuh0OkN0S8MYHx93f2re8vicfcOE/j5gWc/8UmB/b4Oq2g9cCZBkEbC6qg70NHk38KdV9cKgDVTVBmADwOjoaI2NjQ3bf02g0+ng/tR85fE5+4YZ3tkGnJvknCQn0x2muae3QZLFSQ6t63pgY986rsahHUmacxOGflUdBK6lOzSzG7i7qnYlWZfksqbZGLAnyWPACHDjoeWTrKD7SeELM9pzSdKkDTO8Q1VtBjb3ld3QM70J2HSUZZ/gyAu/kqQ54C9yJalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWqRoUI/ycVJ9iTZm2TtgPrlSbYk2Zmkk2RpT92PJ7k/ye4kjzb3zJUkzYEJQz/JAuA24BJgJXB1kpV9zW4B7qyqVcA6YH1P3Z3AJ6rqPOBC4K9mouOSpMkb5kz/QmBvVT1eVc8DdwGX97VZCWxpprceqm/eHE6qqgcAqmq8qp6dkZ5LkiZtmNBfAjzZM7+vKeu1A1jdTF8BnJbkLOCngL9J8idJvpbkE80nB0nSHDhpiDYZUFZ989cBtya5BngQeAo42Kz/LcBPA98C/gdwDfDfDttAsgZYAzAyMkKn0xm2/5rA+Pi4+1Pzlsfn7Bsm9PcBy3rmlwL7extU1X7gSoAki4DVVXUgyT7ga1X1eFP3GeCN9IV+VW0ANgCMjo7W2NjYlJ6MjtTpdHB/ar7y+Jx9wwzvbAPOTXJOkpOBq4B7ehskWZzk0LquBzb2LHtmkrOb+bcDj06/25KkqZgw9KvqIHAtcB+wG7i7qnYlWZfksqbZGLAnyWPACHBjs+yLdId+tiR5hO5Q0Sdn/FlIkoYyzPAOVbUZ2NxXdkPP9CZg01GWfQBYNY0+SpJmiL/IlaQWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFhgr9JBcn2ZNkb5K1A+qXJ9mSZGeSTpKlPXUvJnm4edzTv6wkafZMeI/cJAuA24CfB/YB25LcU1WP9jS7Bbizqu5I8nZgPfD+pu65qrpghvstSZqCYc70LwT2VtXjVfU8cBdweV+blcCWZnrrgHpJ0jwwTOgvAZ7smd/XlPXaAaxupq8ATktyVjP/8iTbk3w5ybum1VtJ0rRMOLwDZEBZ9c1fB9ya5BrgQeAp4GBT9+NVtT/JPwb+PMkjVfWNwzaQrAHWAIyMjNDpdIZ/Bjqm8fFx96fmLY/P2TdM6O8DlvXMLwX29zaoqv3AlQBJFgGrq+pATx1V9XiSDvDTwDf6lt8AbAAYHR2tsbGxKTwVDdLpdHB/ar7y+Jx9wwzvbAPOTXJOkpOBq4DDvoWTZHGSQ+u6HtjYlJ+Z5JRDbYA3A70XgCVJs2jC0K+qg8C1wH3AbuDuqtqVZF2Sy5pmY8CeJI8BI8CNTfl5wPYkO+he4L2p71s/kqRZNMzwDlW1GdjcV3ZDz/QmYNOA5b4InD/NPkqSZoi/yJWkFjH0JalFDH1JahFDX5JaZKgLuZI0Vcmg33dOrKr/N6CaCZ7pSzququqoj+W/9tmj1un4MPQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWoRQ1+SWsTQl6QWGSr0k1ycZE+SvUnWDqhfnmRLkp1JOkmW9tX/aJKnktw6Ux2XJE3ehKGfZAFwG3AJsBK4OsnKvma3AHdW1SpgHbC+r/43gC9Mv7uSpOkY5kz/QmBvVT1eVc8DdwGX97VZCWxpprf21id5PTAC3D/97kqSpmOY0F8CPNkzv68p67UDWN1MXwGcluSsJC8Dfgv48HQ7KkmavmHunDXotjf9dzi4Drg1yTXAg8BTwEHgA8DmqnryWHfPSbIGWAMwMjJCp9MZolsaxvj4uPtT85rH5+waJvT3Act65pcC+3sbVNV+4EqAJIuA1VV1IMmbgLck+QCwCDg5yXhVre1bfgOwAWB0dLTGxsam+HTUr9Pp4P7UvPX5ez0+Z9kwob8NODfJOXTP4K8C3tvbIMli4Omq+nvgemAjQFX9y5421wCj/YEvSZo9E47pV9VB4FrgPmA3cHdV7UqyLsllTbMxYE+Sx+hetL3xOPVXkjQNw5zpU1Wbgc19ZTf0TG8CNk2wjtuB2yfdQ0nSjPEXuZLUIoa+JLXIUMM7kjSR1378fg4898Kkl1ux9t6h255+6kJ2fPSiSW9D/5+hL2lGHHjuBZ646dJJLTPZrxRP5g1Cgzm8I0ktYuhLUosY+pLUIoa+JLWIoS9JLWLoS1KLGPqS1CKGviS1iKEvSS1i6EtSixj6ktQihr4ktYihL0ktYuhLUosMFfpJLk6yJ8neJEfc2DzJ8iRbkuxM0kmytKf8oSQPJ9mV5N/O9BOQJA1vwtBPsgC4DbgEWAlcnWRlX7NbgDurahWwDljflH8b+NmqugB4A7A2yatnqvOSpMkZ5kz/QmBvVT1eVc8DdwGX97VZCWxpprceqq+q56vq75ryU4bcniTpOBkmhJcAT/bM72vKeu0AVjfTVwCnJTkLIMmyJDubddxcVfun12VJ0lQNc7vEDCirvvnrgFuTXAM8CDwFHASoqieBVc2wzmeSbKqq7xy2gWQNsAZgZGSETqczmeegYxgfH3d/atZM9libyvHp8Tw9w4T+PmBZz/xS4LCz9ebs/UqAJIuA1VV1oL9Nkl3AW4BNfXUbgA0Ao6OjNZl7ZurYJnsPUmnKPn/vpI+1SR+fU9iGDjfM8M424Nwk5yQ5GbgKuKe3QZLFSQ6t63pgY1O+NMmpzfSZwJuBPTPVeUnS5EwY+lV1ELgWuA/YDdxdVbuSrEtyWdNsDNiT5DFgBLixKT8P+EqSHcAXgFuq6pEZfg6SpCGlqn94fm6Njo7W9u3b57obJwyHdzRbzr/j/FnZziO/7HnjIEkeqqrRidoNM6YvSRP6we6beOKmSye1zGRPSlasvXeSvVI/vzcvSS1i6EtSixj6ktQijulLmjFTGnP//PDLnH7qwsmvX4cx9CXNiMlexIXum8RUltPUObwjSS1i6EtSixj6ktQihr4ktYgXciUdV8mgv87eU3/z4PL59idiThSe6Us6rqrqqI+tW7cetU7Hh6EvSS1i6EtSixj6ktQihr4ktYihL0ktYuhLUosY+pLUIoa+JLXIvLsxepLvAn851/04gSwG/nquOyEdhcfnzFleVWdP1Gjehb5mVpLtVTU61/2QBvH4nH0O70hSixj6ktQihv6Jb8Ncd0A6Bo/PWeaYviS1iGf6ktQihr6kIyQ5I8kHprjsryb5kZnuk2aGoT8HpvqCSrI5yRnHo09SnzOAKYU+8KvArIV+kgWzta0TgaE/Nwa+oCY6eKvqnVX1N8etV0PyRdYKNwE/keThJJ9I8uEk25LsTPJxgCSvSHJvkh1Jvp7kPUk+BLwa2Jpk66AVJ1mQ5PZmmUeS/Lum/CeT/M9mfX+R5CfS9Ymetu9p2o4l2Zrk08AjTdn7kny16fPve5wexbFuZebj+DyAu4DngIeBbcBW4NPAo039Z4CHgF3Amp7lnqD7C8YVwG7gk02b+4FTj7G9DwGPAjuBu5qyRcAf0H3B7ARWN+VXN2VfB27uWcc4sA74CvDPgdcDX2j6eR/wqrnerz5m9BhdAXy9mb6I7rdsQvdE8bPAW4HVwCd7ljm9+fcJYPEx1v164IGe+TOaf78CXNFMv5zup4XVwAPAAmAE+BbwKmAMeAY4p2l/HvBnwMJm/neBX5rr/TgfH3PegTY++l5Qhx28Tdkrm39PbcL3rGa+N/QPAhc05XcD7zvG9vYDpzTTh15gNwP/uafNmXTP0L4FnA2cBPw58K6mvoB3N9MLgS8CZzfz7wE2zvV+9XHcjtFbmmPv4eaxF/jXwE8B32yOpbf0LDtR6J8JfAP4L8DFzRvJacC+AW3/E/ArPfP/Hbised1s7Sm/tjnOD/VxD/Cxud6P8/FxEpoPvlpV3+yZ/1CSK5rpZcC5wPf6lvlmVT3cTD9E90V6NDuBP0zyGbqfIgD+BXDVoQZV9f0kbwU6VfVdgCR/SPeM7jPAi8AfN83/CfAa4IEk0D0L+/ZwT1UvQQHWV9XvH1GRvB54J7A+yf1VtW6ilTXH2muBdwAfBN5N9zrA0bZ9NM/0tbujqq6faPtt55j+/PAPB2+SMbqB/Kaqei3wNbofdfv9Xc/0i3DMN/BLgdvofqx+KMlJdF8k/T/SONYL7IdV9WJPu11VdUHzOL+qLjrGsnrp+QHds2/oDt/9SpJFAEmWJPmxJK8Gnq2qT9H9NPC6AcseIcli4GVV9cfAfwBeV1V/C+xL8q6mzSnNN4AeBN7TXAc4m+5JyFcHrHYL8ItJfqxZ/pVJlk9nB5yoDP25cawXxenA96vq2ST/FHjjdDaU5GXAsqraCnyE7kXkRXSvA1zb0+5MumOqb0uyuLkIdjXdcft+e4Czk7ypWXZhkn82nX5qfqmq7wH/O8nXgZ+ne83pS0keATbRPX7PB76a5GHg3wO/2Sy+Afjc0S7kAkuATrPc7cChs/P30/2Uu5Pu8OE/Av6U7ifVHXSHGz9SVf93QH8fBX4duL9Z/gG6Y//q4y9y50jzrYNVdC/ofqeqfqEpP4XucMoSmnClOzbZSfIEMEo3tD9bVa9plrkOWFRVHxuwnYV0LxSfTvcM/VNVdVNz1nbo7P9F4ONV9SdJ3kv3RRhgc1V9pFnPeFUt6lnvBcDvNOs9ie71gU/O4C6SdBwY+pLUIl7IlXTcJPkKcEpf8fur6pG56I880z+hJLkNeHNf8W9X1R/MRX8kzT+GviS1iN/ekaQWMfQlqUUMfUlqEUNfklrE0JekFvl/v5wK416soFMAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x256354007b8>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "pipe = make_pipeline(MinMaxScaler(),\n",
    "                     LogisticRegression(solver='saga', multi_class='auto', random_state=42, max_iter=5000))\n",
    "param_grid = {'logisticregression__C': [0.1, 1.0, 10],\n",
    "              'logisticregression__penalty': ['l2', 'l1']}\n",
    "grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)\n",
    "scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True))\n",
    "scores[['train_score', 'test_score']].boxplot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Heterogeneous data: when you work with data other than numerical\n",
    "# 6.异构数据：当您使用数字以外的数据时"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Up to now, we used `scikit-learn` to train model using numerical data.\n",
    "\n",
    "到目前为止，我们使用`scikit-learn`来训练使用数值数据的模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0., ..., 10.,  0.,  0.],\n",
       "       [ 0.,  0.,  0., ..., 16.,  9.,  0.],\n",
       "       ...,\n",
       "       [ 0.,  0.,  1., ...,  6.,  0.,  0.],\n",
       "       [ 0.,  0.,  2., ..., 12.,  0.,  0.],\n",
       "       [ 0.,  0., 10., ..., 12.,  1.,  0.]])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`X` is a NumPy array of `float` values only. However, datasets can contains mixed types.\n",
    "\n",
    "`X`是仅包含浮点值的`NumPy`数组。 但是，数据集可以包含混合类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>survived</th>\n",
       "      <th>name</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>ticket</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "      <th>boat</th>\n",
       "      <th>body</th>\n",
       "      <th>home.dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Allen, Miss. Elisabeth Walton</td>\n",
       "      <td>female</td>\n",
       "      <td>29.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24160</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>B5</td>\n",
       "      <td>S</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>St Louis, MO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Allison, Master. Hudson Trevor</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Miss. Helen Loraine</td>\n",
       "      <td>female</td>\n",
       "      <td>2.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mr. Hudson Joshua Creighton</td>\n",
       "      <td>male</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>135.0</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mrs. Hudson J C (Bessie Waldo Daniels)</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pclass  survived                                             name     sex  \\\n",
       "0       1         1                    Allen, Miss. Elisabeth Walton  female   \n",
       "1       1         1                   Allison, Master. Hudson Trevor    male   \n",
       "2       1         0                     Allison, Miss. Helen Loraine  female   \n",
       "3       1         0             Allison, Mr. Hudson Joshua Creighton    male   \n",
       "4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   \n",
       "\n",
       "       age  sibsp  parch  ticket      fare    cabin embarked boat   body  \\\n",
       "0  29.0000      0      0   24160  211.3375       B5        S    2    NaN   \n",
       "1   0.9167      1      2  113781  151.5500  C22 C26        S   11    NaN   \n",
       "2   2.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   \n",
       "3  30.0000      1      2  113781  151.5500  C22 C26        S  NaN  135.0   \n",
       "4  25.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   \n",
       "\n",
       "                         home.dest  \n",
       "0                     St Louis, MO  \n",
       "1  Montreal, PQ / Chesterville, ON  \n",
       "2  Montreal, PQ / Chesterville, ON  \n",
       "3  Montreal, PQ / Chesterville, ON  \n",
       "4  Montreal, PQ / Chesterville, ON  "
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "data = pd.read_csv(os.path.join('data', 'titanic_openml.csv'), na_values='?')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `titanic` dataset contains both categorical, text, and numeric features. We will use this dataset to predict whether a passenger survived the Titanic or not. \n",
    "\n",
    "Let's split the data into training and testing sets and use the `survived` column as a target.\n",
    "\n",
    "泰坦尼克号数据集包含分类，文本和数字特征。 我们将使用此数据集来预测乘客是否在泰坦尼克号中幸存下来。\n",
    "\n",
    "让我们将数据拆分为训练和测试集，并将幸存列用作目标。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = data['survived']\n",
    "X = data.drop(columns='survived')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One could try a `LogisticRegression` classifier and see how good it is performing.\n",
    "\n",
    "首先，可以尝试使用`LogisticRegression`分类器，看看它的表现有多好。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "could not convert string to float: 'S'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-53-45c060082178>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0mclf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mclf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;31m#这里肯定会报错。\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py\u001b[0m in \u001b[0;36mfit\u001b[1;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[0;32m   1286\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1287\u001b[0m         X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order=\"C\",\n\u001b[1;32m-> 1288\u001b[1;33m                          accept_large_sparse=solver != 'liblinear')\n\u001b[0m\u001b[0;32m   1289\u001b[0m         \u001b[0mcheck_classification_targets\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1290\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclasses_\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munique\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[1;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[0;32m    754\u001b[0m                     \u001b[0mensure_min_features\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mensure_min_features\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    755\u001b[0m                     \u001b[0mwarn_on_dtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mwarn_on_dtype\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 756\u001b[1;33m                     estimator=estimator)\n\u001b[0m\u001b[0;32m    757\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    758\u001b[0m         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,\n",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[1;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[0;32m    525\u001b[0m             \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    526\u001b[0m                 \u001b[0mwarnings\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msimplefilter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'error'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 527\u001b[1;33m                 \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    528\u001b[0m             \u001b[1;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    529\u001b[0m                 raise ValueError(\"Complex data not supported\\n\"\n",
      "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\numpy\\core\\numeric.py\u001b[0m in \u001b[0;36masarray\u001b[1;34m(a, dtype, order)\u001b[0m\n\u001b[0;32m    490\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    491\u001b[0m     \"\"\"\n\u001b[1;32m--> 492\u001b[1;33m     \u001b[1;32mreturn\u001b[0m \u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    493\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    494\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: could not convert string to float: 'S'"
     ]
    }
   ],
   "source": [
    "clf = LogisticRegression()\n",
    "clf.fit(X_train, y_train)#这里肯定会报错。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whoops, most of the classifiers are designed to work with numerical data. Therefore, we need to convert the categorical data into numeric features. The simplest way is to one-hot encode each categorical feature with the `OneHotEncoder`. Let's give an example for the `sex` and `embarked` columns. Note that we also encounter some data which are missing. We will use a `SimpleImputer` to replace the missing values with a constant values.\n",
    "\n",
    "哎呀，大多数分类器都设计用于处理数值数据。 因此，我们需要将分类数据转换为数字特征。 最简单的方法是使用`OneHotEncoder`对每个分类特征进行读热编码。 让我们以`sex`与`embarked`列为例。 请注意，我们还会遇到一些缺失的数据。 我们将使用`SimpleImputer`用常量值替换缺失值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0., 1., 0., 0., 1., 0.],\n",
       "       [0., 1., 1., 0., 0., 0.],\n",
       "       [0., 1., 0., 0., 1., 0.],\n",
       "       ...,\n",
       "       [0., 1., 0., 0., 1., 0.],\n",
       "       [1., 0., 0., 0., 1., 0.],\n",
       "       [1., 0., 0., 0., 1., 0.]])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "ohe = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder())\n",
    "X_encoded = ohe.fit_transform(X_train[['sex', 'embarked']])\n",
    "X_encoded.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This way, it is possible to encode the categorical features. However, we also want to standardize the numerical features. Thus, we need to split the original data into 2 subgroups and apply a different preprocessing: (i) one-hot encoding for the categorical data and (ii) standard scaling for the numerical data. We also need to handle missing values in both cases. For the categorical column, we replace the missing values by the string `'missing_values'` which will be interpreted as a category on its own. For the numerical data, we will replace the missing data by the mean values of the feature of interest.\n",
    "\n",
    "这样，可以对分类特征进行编码。 但是，我们也希望标准化数字特征。 因此，我们需要将原始数据分成2个子组并应用不同的预处理：（i）分类数据的独热编；（ii）数值数据的标准缩放(归一化)。 我们还需要处理两种情况下的缺失值： 对于分类列，我们将字符串'`missing_values`'替换为缺失值，该字符串将自行解释为类别。 对于数值数据，我们将用感兴趣的特征的平均值替换缺失的数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_cat = ['sex', 'embarked']\n",
    "col_num = ['age', 'sibsp', 'parch', 'fare']\n",
    "\n",
    "X_train_cat = X_train[col_cat]\n",
    "X_train_num = X_train[col_num]\n",
    "X_test_cat = X_test[col_cat]\n",
    "X_test_num = X_test[col_num]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder())\n",
    "X_train_cat_enc = scaler_cat.fit_transform(X_train_cat)\n",
    "X_test_cat_enc = scaler_cat.transform(X_test_cat)\n",
    "\n",
    "scaler_num = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())\n",
    "X_train_num_scaled = scaler_num.fit_transform(X_train_num)\n",
    "X_test_num_scaled = scaler_num.transform(X_test_num)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should apply these transformations on the training and testing sets as we did in Sect. 2.1\n",
    "\n",
    "我们应该像在本文2.1中那样在训练和测试集上应用这些变换。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy import sparse\n",
    "\n",
    "X_train_scaled = sparse.hstack((X_train_cat_enc,\n",
    "                                sparse.csr_matrix(X_train_num_scaled)))\n",
    "X_test_scaled = sparse.hstack((X_test_cat_enc,\n",
    "                               sparse.csr_matrix(X_test_num_scaled)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the transformation is done, we can combine the informations which are all numerical now. Finally, we use our `LogisticRegression` classifier as a model.\n",
    "\n",
    "转换完成后，我们现在可以组合所有数值的信息。最后，我们使用`LogisticRegression`分类器作为模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the LogisticRegression is 0.79\n"
     ]
    }
   ],
   "source": [
    "clf = LogisticRegression(solver='lbfgs')\n",
    "clf.fit(X_train_scaled, y_train)\n",
    "accuracy = clf.score(X_test_scaled, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above pattern of first transforming the data and then fitting/scoring the classifier is exactly the one of Sect. 2.1. Therefore, we would like to use a pipeline for such purpose. However, we would also like to have different processing on different columns of our matrix. The `ColumnTransformer` transformer or the `make_column_transformer` function should be used. It is used to automatically apply different pipeline on different columns.\n",
    "\n",
    "上面首先转换数据然后拟合/评分分类器的模式恰好是本节2.1的模式之一。因此，我们希望为此目的使用管道。但是，我们还希望对矩阵的不同列进行不同的处理。应使用`ColumnTransformer`转换器或`make_column_transformer`函数。它用于在不同的列上自动应用不同的管道。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score of the Pipeline is 0.79\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\compose\\_column_transformer.py:739: DeprecationWarning: `make_column_transformer` now expects (transformer, columns) as input tuples instead of (columns, transformer). This has been introduced in v0.20.1. `make_column_transformer` will stop accepting the deprecated (columns, transformer) order in v0.22.\n",
      "  warnings.warn(message, DeprecationWarning)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.compose import make_column_transformer\n",
    "\n",
    "pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))\n",
    "pipe_num = make_pipeline(SimpleImputer(), StandardScaler())\n",
    "preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))\n",
    "\n",
    "pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs'))\n",
    "\n",
    "pipe.fit(X_train, y_train)\n",
    "accuracy = pipe.score(X_test, y_test)\n",
    "print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides, it can also be used in another pipeline. Thus, we will be able to use all `scikit-learn` utilities as `cross_validate` or `GridSearchCV`.\n",
    "\n",
    "此外，它还可以被使用在另一个管道。 因此，我们将能够使用所有`scikit-learn`实用程序作为`cross_validate`或`GridSearchCV`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'columntransformer': ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,\n",
       "          transformer_weights=None,\n",
       "          transformers=[('pipeline-1', Pipeline(memory=None,\n",
       "      steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "        strategy='constant', verbose=0)), ('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
       "        dtype=<class 'numpy.float64'>, ha...r', StandardScaler(copy=True, with_mean=True, with_std=True))]), ['age', 'sibsp', 'parch', 'fare'])]),\n",
       " 'columntransformer__n_jobs': None,\n",
       " 'columntransformer__pipeline-1': Pipeline(memory=None,\n",
       "      steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "        strategy='constant', verbose=0)), ('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
       "        dtype=<class 'numpy.float64'>, handle_unknown='ignore',\n",
       "        n_values=None, sparse=True))]),\n",
       " 'columntransformer__pipeline-1__memory': None,\n",
       " 'columntransformer__pipeline-1__onehotencoder': OneHotEncoder(categorical_features=None, categories=None,\n",
       "        dtype=<class 'numpy.float64'>, handle_unknown='ignore',\n",
       "        n_values=None, sparse=True),\n",
       " 'columntransformer__pipeline-1__onehotencoder__categorical_features': None,\n",
       " 'columntransformer__pipeline-1__onehotencoder__categories': None,\n",
       " 'columntransformer__pipeline-1__onehotencoder__dtype': numpy.float64,\n",
       " 'columntransformer__pipeline-1__onehotencoder__handle_unknown': 'ignore',\n",
       " 'columntransformer__pipeline-1__onehotencoder__n_values': None,\n",
       " 'columntransformer__pipeline-1__onehotencoder__sparse': True,\n",
       " 'columntransformer__pipeline-1__simpleimputer': SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "        strategy='constant', verbose=0),\n",
       " 'columntransformer__pipeline-1__simpleimputer__copy': True,\n",
       " 'columntransformer__pipeline-1__simpleimputer__fill_value': None,\n",
       " 'columntransformer__pipeline-1__simpleimputer__missing_values': nan,\n",
       " 'columntransformer__pipeline-1__simpleimputer__strategy': 'constant',\n",
       " 'columntransformer__pipeline-1__simpleimputer__verbose': 0,\n",
       " 'columntransformer__pipeline-1__steps': [('simpleimputer',\n",
       "   SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "          strategy='constant', verbose=0)),\n",
       "  ('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
       "          dtype=<class 'numpy.float64'>, handle_unknown='ignore',\n",
       "          n_values=None, sparse=True))],\n",
       " 'columntransformer__pipeline-2': Pipeline(memory=None,\n",
       "      steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',\n",
       "        verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))]),\n",
       " 'columntransformer__pipeline-2__memory': None,\n",
       " 'columntransformer__pipeline-2__simpleimputer': SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',\n",
       "        verbose=0),\n",
       " 'columntransformer__pipeline-2__simpleimputer__copy': True,\n",
       " 'columntransformer__pipeline-2__simpleimputer__fill_value': None,\n",
       " 'columntransformer__pipeline-2__simpleimputer__missing_values': nan,\n",
       " 'columntransformer__pipeline-2__simpleimputer__strategy': 'mean',\n",
       " 'columntransformer__pipeline-2__simpleimputer__verbose': 0,\n",
       " 'columntransformer__pipeline-2__standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),\n",
       " 'columntransformer__pipeline-2__standardscaler__copy': True,\n",
       " 'columntransformer__pipeline-2__standardscaler__with_mean': True,\n",
       " 'columntransformer__pipeline-2__standardscaler__with_std': True,\n",
       " 'columntransformer__pipeline-2__steps': [('simpleimputer',\n",
       "   SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',\n",
       "          verbose=0)),\n",
       "  ('standardscaler',\n",
       "   StandardScaler(copy=True, with_mean=True, with_std=True))],\n",
       " 'columntransformer__remainder': 'drop',\n",
       " 'columntransformer__sparse_threshold': 0.3,\n",
       " 'columntransformer__transformer_weights': None,\n",
       " 'columntransformer__transformers': [('pipeline-1', Pipeline(memory=None,\n",
       "        steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "          strategy='constant', verbose=0)), ('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
       "          dtype=<class 'numpy.float64'>, handle_unknown='ignore',\n",
       "          n_values=None, sparse=True))]), ['sex', 'embarked']),\n",
       "  ('pipeline-2', Pipeline(memory=None,\n",
       "        steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',\n",
       "          verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))]), ['age',\n",
       "    'sibsp',\n",
       "    'parch',\n",
       "    'fare'])],\n",
       " 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "           intercept_scaling=1, max_iter=100, multi_class='warn',\n",
       "           n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',\n",
       "           tol=0.0001, verbose=0, warm_start=False),\n",
       " 'logisticregression__C': 1.0,\n",
       " 'logisticregression__class_weight': None,\n",
       " 'logisticregression__dual': False,\n",
       " 'logisticregression__fit_intercept': True,\n",
       " 'logisticregression__intercept_scaling': 1,\n",
       " 'logisticregression__max_iter': 100,\n",
       " 'logisticregression__multi_class': 'warn',\n",
       " 'logisticregression__n_jobs': None,\n",
       " 'logisticregression__penalty': 'l2',\n",
       " 'logisticregression__random_state': None,\n",
       " 'logisticregression__solver': 'lbfgs',\n",
       " 'logisticregression__tol': 0.0001,\n",
       " 'logisticregression__verbose': 0,\n",
       " 'logisticregression__warm_start': False,\n",
       " 'memory': None,\n",
       " 'steps': [('columntransformer',\n",
       "   ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,\n",
       "            transformer_weights=None,\n",
       "            transformers=[('pipeline-1', Pipeline(memory=None,\n",
       "        steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "          strategy='constant', verbose=0)), ('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
       "          dtype=<class 'numpy.float64'>, ha...r', StandardScaler(copy=True, with_mean=True, with_std=True))]), ['age', 'sibsp', 'parch', 'fare'])])),\n",
       "  ('logisticregression',\n",
       "   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "             intercept_scaling=1, max_iter=100, multi_class='warn',\n",
       "             n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',\n",
       "             tol=0.0001, verbose=0, warm_start=False))]}"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.get_params()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\compose\\_column_transformer.py:739: DeprecationWarning: `make_column_transformer` now expects (transformer, columns) as input tuples instead of (columns, transformer). This has been introduced in v0.20.1. `make_column_transformer` will stop accepting the deprecated (columns, transformer) order in v0.22.\n",
      "  warnings.warn(message, DeprecationWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x25636651c88>"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD9CAYAAAC85wBuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAGGdJREFUeJzt3X+w3XWd3/Hny0DQ9Qc/BG8xsCS7DRaEGtc77Dq09mJXjLoj7Ohg0lFx1jFjKzpqVWBqEdM6i7Pb0t2d2NnYRVDQ1MVVUxMJdMm1joIm7EYgcaIxoGSi1VVYvf4Ayb77x/lkezw5N/d7f+UGeD5mvpPz/Xw/n8/5fE++97zO9/M9P1JVSJL0pIUegCTp6GAgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSc8xCD2A6Tj755Fq6dOlCD+Nx46c//SlPfepTF3oY0iE8NufWXXfd9XdVdcpU9R5TgbB06VK2b9++0MN43BgfH2dsbGyhhyEdwmNzbiX5dpd6ThlJkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVLzmPpgmqTHlyTTbuPvwM8fzxAkLZiqGrqccfnnJt2m+WMgSJIAA0GS1BgIkiSgYyAkWZlkd5I9Sa4Ysv3Xk2xN8rdJ7k7y8r5tV7Z2u5O8tGufkqQja8pASLIIWAe8DDgbWJ3k7IFq7wU+WVXPB1YBH2ptz27rzwVWAh9Ksqhjn5KkI6jLGcJ5wJ6q2ltVjwAbgIsG6hTwjHb7eGB/u30RsKGqHq6q+4A9rb8ufUqSjqAun0NYAjzQt74P+O2BOlcDtyZ5K/BU4Hf72t450HZJuz1VnwAkWQOsARgZGWF8fLzDkNXFxMSEj6eOWh6bR16XQBj2yZHBNwOvBq6vqv+S5IXAx5Kcc5i2w85Mhr7BuKrWA+sBRkdHy19Rmjv+KpWOWrds8thcAF0CYR9wet/6afz/KaGD3kjvGgFVdUeSJwMnT9F2qj4lSUdQl2sI24DlSZYlWUzvIvHGgTrfAf41QJKzgCcDP2j1ViU5LskyYDnw1Y59SpKOoCnPEKrq0SSXAVuARcB1VbUzyVpge1VtBP498OEk76A39fOG6n3GfGeSTwK7gEeBt1TVAYBhfc7D/kmSOur05XZVtRnYPFB2Vd/tXcD5k7T9APCBLn1KkhaOn1SWJAEGgiSpMRAkSYA/kCNpnj3v/bfy9z//5bTbLb1iU+e6xz/lWL72vgunfR/6VQaCpHn19z//Jfdf84pptZnuhyanEx6anFNGkiTAQJAkNQaCJAkwECRJjYEgSQIMBElS49tOH+eSYT9JMbXedxNKeiLxDOFxrqomXc64/HOTbpP0xGMgSJIAA0GS1BgIkiTAQJAkNZ0CIcnKJLuT7ElyxZDt1ybZ0ZZvJHmolV/QV74jyS+SXNy2XZ/kvr5tK+Z21yRJ0zHl206TLALWAS8B9gHbkmxsP5sJQFW9o6/+W4Hnt/KtwIpWfhKwB7i1r/t3V9XNc7AfT3jn3nDutNs8/Sw494ZD8v2w7rn0nmnfj6THhi6fQzgP2FNVewGSbAAuAnZNUn818L4h5a8GPl9VP5vJQHV4P/n6NX7FsKRZ6TJltAR4oG99Xys7RJIzgGXA7UM2rwI+MVD2gSR3tymn4zqMRZI0T7qcIQz7qOtkn1xaBdxcVQd+pYPkVOBcYEtf8ZXA94DFwHrgcmDtIXeerAHWAIyMjDA+Pt5hyE9M031sJiYmpt3Gx18z4bH52NAlEPYBp/etnwbsn6TuKuAtQ8ovAT5dVf/4O3pV9d128+EkHwHeNazDqlpPLzAYHR2t6UxxPKHcsmla0z8w/SmjmdyH5LH52NFlymgbsDzJsiSL6T3pbxyslOQ5wInAHUP6WM3AdFE7ayC9L9u5GLh3ekOXJM2lKc8QqurRJJfRm+5ZBFxXVTuTrAW2V9XBcFgNbKiBL8JJspTeGcYXBrq+Kckp9KakdgBvns2OSJJmp9O3nVbVZmDzQNlVA+tXT9L2foZchK6qF3cdpCRp/vlJZUkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqOv1Ajh4bll6xafqNbune5vinHDv9/iU9ZnQKhCQrgT+h9xOa/6OqrhnYfi1wQVv9NeBZVXVC23YAuKdt+05VvbKVLwM2ACcBfwO8rqoemd3uPHHdf80rpt1m6RWbZtRO0uPTlFNGSRYB64CXAWcDq5Oc3V+nqt5RVSuqagXwZ8Bf9W3++cFtB8Og+SBwbVUtBx4E3jjLfZEkzUKXawjnAXuqam97Bb8BuOgw9VcDnzhch0kCvBi4uRXdAFzcYSySpHnSJRCWAA/0re9rZYdIcgawDLi9r/jJSbYnuTPJwSf9ZwIPVdWjU/UpSToyulxDyJCymqTuKuDmqjrQV/brVbU/yW8Atye5B/hx1z6TrAHWAIyMjDA+Pt5hyOrKx1NHwnSPs4mJiWm38VievS6BsA84vW/9NGD/JHVXAW/pL6iq/e3fvUnGgecDnwJOSHJMO0uYtM+qWg+sBxgdHa2xsbEOQ1Ynt2zCx1PzbgbH2fj4+PTaeCzPiS5TRtuA5UmWJVlM70l/42ClJM8BTgTu6Cs7Mclx7fbJwPnArqoqYCvw6lb1UuCzs9kRDZdk0uXbH/y9SbdJeuKZMhDaK/jLgC3A14FPVtXOJGuT9L9raDWwoT3ZH3QWsD3J1+gFwDVVtattuxx4Z5I99K4p/MXsd0eDqmrSZevWrZNuk/TE0+lzCFW1Gdg8UHbVwPrVQ9p9GTh3kj730nsHkyTpKOBXV0iSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUdAqEJCuT7E6yJ8kVQ7Zfm2RHW76R5KFWviLJHUl2Jrk7yWv62lyf5L6+divmbrckSdM15W8qJ1kErANeAuwDtiXZWFW7Dtapqnf01X8r8Py2+jPg9VX1zSTPBu5KsqWqHmrb311VN8/RvkiSZqHLGcJ5wJ6q2ltVjwAbgIsOU3818AmAqvpGVX2z3d4PfB84ZXZDliTNhy6BsAR4oG99Xys7RJIzgGXA7UO2nQcsBr7VV/yBNpV0bZLjOo9akjTnppwyAjKkrCapuwq4uaoO/EoHyanAx4BLq+ofWvGVwPfohcR64HJg7SF3nqwB1gCMjIwwPj7eYcjqYmJiwsdTR8R0j7OZHJsey7PXJRD2Aaf3rZ8G7J+k7irgLf0FSZ4BbALeW1V3Hiyvqu+2mw8n+QjwrmEdVtV6eoHB6OhojY2NdRiyuhgfH8fHU/Pulk3TPs6mfWzO4D50qC5TRtuA5UmWJVlM70l/42ClJM8BTgTu6CtbDHwa+GhV/eVA/VPbvwEuBu6d6U5IkmZvyjOEqno0yWXAFmARcF1V7UyyFtheVQfDYTWwoar6p5MuAV4EPDPJG1rZG6pqB3BTklPoTUntAN48J3skSZqRLlNGVNVmYPNA2VUD61cPaXcjcOMkfb648yglSfPOTypLkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJKAjoGQZGWS3Un2JLliyPZrk+xoyzeSPNS37dIk32zLpX3lL0hyT+vzT9tvK0uSFsiUP6GZZBGwDngJsA/YlmRjVe06WKeq3tFX/63A89vtk4D3AaNAAXe1tg8C/x1YA9xJ7+c5VwKfn6P9kiRNU5czhPOAPVW1t6oeATYAFx2m/mrgE+32S4HbqupHLQRuA1YmORV4RlXdUVUFfBS4eMZ7IUmatS6BsAR4oG99Xys7RJIzgGXA7VO0XdJuT9mnJOnImHLKCBg2t1+T1F0F3FxVB6Zo27nPJGvoTS0xMjLC+Pj4YQer7iYmJnw8dURM9zibybHpsTx7XQJhH3B63/ppwP5J6q4C3jLQdmyg7XgrP61Ln1W1HlgPMDo6WmNjY8OqaQbGx8fx8dS8u2XTtI+zaR+bM7gPHarLlNE2YHmSZUkW03vS3zhYKclzgBOBO/qKtwAXJjkxyYnAhcCWqvou8JMkv9PeXfR64LOz3BdJ0ixMeYZQVY8muYzek/si4Lqq2plkLbC9qg6Gw2pgQ7tIfLDtj5L8J3qhArC2qn7Ubv9b4HrgKfTeXeQ7jCRpAXWZMqKqNtN7a2h/2VUD61dP0vY64Loh5duBc7oOVJI0v/yksiQJMBAkSY2BIEkCDARJUmMgSJKAju8ykqSZevpZV3DuDYd8SfLUbpjOfQC8Yvr3oV9hIEiaVz/5+jXcf830nqyn+0nlpVdsmuaoNIxTRpIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAP4cg6QiY0ecEbune5vinHDv9/nUIA0HSvJruh9KgFyAzaafZccpIkgQYCJKkplMgJFmZZHeSPUmGfktVkkuS7EqyM8nHW9kFSXb0Lb9IcnHbdn2S+/q2rZi73ZIkTdeU1xCSLALWAS8B9gHbkmysql19dZYDVwLnV9WDSZ4FUFVbgRWtzknAHuDWvu7fXVU3z9XOSJJmrssZwnnAnqraW1WPABuAiwbqvAlYV1UPAlTV94f082rg81X1s9kMWJI0P7q8y2gJ8EDf+j7gtwfqnAmQ5EvAIuDqqrploM4q4L8OlH0gyVXAXwNXVNXDg3eeZA2wBmBkZITx8fEOQ1YXExMTPp46anlsHnldAiFDympIP8uBMeA04ItJzqmqhwCSnAqcC2zpa3Ml8D1gMbAeuBxYe8gdVa1v2xkdHa3pfEe6Dm+63zkvHTG3bPLYXABdpoz2Aaf3rZ8G7B9S57NV9cuqug/YTS8gDroE+HRV/fJgQVV9t3oeBj5Cb2pKkrRAugTCNmB5kmVJFtOb+tk4UOczwAUASU6mN4W0t2/7auAT/Q3aWQNJAlwM3DuTHZAkzY0pp4yq6tEkl9Gb7lkEXFdVO5OsBbZX1ca27cIku4AD9N499EOAJEvpnWF8YaDrm5KcQm9Kagfw5rnZJUnSTHT66oqq2gxsHii7qu92Ae9sy2Db++ldmB4sf/E0xypJmkd+UlmSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgR0DIQkK5PsTrInyRWT1Lkkya4kO5N8vK/8QJIdbdnYV74syVeSfDPJ/2y/1yxJWiBTBkKSRcA64GXA2cDqJGcP1FkOXAmcX1XPBd7et/nnVbWiLa/sK/8gcG1VLQceBN44u12RJM1GlzOE84A9VbW3qh4BNgAXDdR5E7Cuqh4EqKrvH67DJAFeDNzcim4ALp7OwCVJc6tLICwBHuhb39fK+p0JnJnkS0nuTLKyb9uTk2xv5Qef9J8JPFRVjx6mT0nSEXRMhzoZUlZD+lkOjAGnAV9Mck5VPQT8elXtT/IbwO1J7gF+3KHP3p0na4A1ACMjI4yPj3cYsrqYmJjw8dRRy2PzyOsSCPuA0/vWTwP2D6lzZ1X9ErgvyW56AbGtqvYDVNXeJOPA84FPASckOaadJQzrk9ZuPbAeYHR0tMbGxjrumqYyPj6Oj6eOSrds8thcAF2mjLYBy9u7ghYDq4CNA3U+A1wAkORkelNIe5OcmOS4vvLzgV1VVcBW4NWt/aXAZ2e7M5KkmZsyENor+MuALcDXgU9W1c4ka5McfNfQFuCHSXbRe6J/d1X9EDgL2J7ka638mqra1dpcDrwzyR561xT+Yi53TJI0PV2mjKiqzcDmgbKr+m4X8M629Nf5MnDuJH3upfcOJknSUcBPKkuSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkoCOgZBkZZLdSfYkuWKSOpck2ZVkZ5KPt7IVSe5oZXcneU1f/euT3JdkR1tWzM0uSZJmYsqf0EyyCFgHvATYB2xLsrHvt5FJshy4Eji/qh5M8qy26WfA66vqm0meDdyVZEtVPdS2v7uqbp7LHZIkzUyXM4TzgD1VtbeqHgE2ABcN1HkTsK6qHgSoqu+3f79RVd9st/cD3wdOmavBS5LmTpdAWAI80Le+r5X1OxM4M8mXktyZZOVgJ0nOAxYD3+or/kCbSro2yXHTHLskaQ5NOWUEZEhZDelnOTAGnAZ8Mck5B6eGkpwKfAy4tKr+obW5EvgevZBYD1wOrD3kzpM1wBqAkZERxsfHOwxZXUxMTPh46qjlsXnkdQmEfcDpfeunAfuH1Lmzqn4J3JdkN72A2JbkGcAm4L1VdefBBlX13Xbz4SQfAd417M6raj29wGB0dLTGxsY6DFldjI+P4+Opo9Itmzw2F0CXKaNtwPIky5IsBlYBGwfqfAa4ACDJyfSmkPa2+p8GPlpVf9nfoJ01kCTAxcC9s9kRSdLsTHmGUFWPJrkM2AIsAq6rqp1J1gLbq2pj23Zhkl3AAXrvHvphktcCLwKemeQNrcs3VNUO4KYkp9CbktoBvHmud06S1F2XKSOqajOweaDsqr7bBbyzLf11bgRunKTPF093sJKk+dMpECRpPvRmjCfZ9sHh5b3Xn5oPfnWFpAVTVUOXrVu3TrpN88dAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkJo+lD3ok+QHw7YUex+PIycDfLfQgpCE8NufWGVU15Y+TPaYCQXMryfaqGl3ocUiDPDYXhlNGkiTAQJAkNQbCE9v6hR6ANAmPzQXgNQRJEuAZgiSpMRAkTUuSE5L8uxm2fXuSX5vrMWluGAhHkZn+oSXZnOSE+RiTNMQJwIwCAXg7cMQCIcmiI3VfjwcGwtFl6B/aVAd1Vb28qh6at1F15B/fE8Y1wG8m2ZHkj5K8O8m2JHcneT9Akqcm2ZTka0nuTfKaJG8Dng1sTbJ1WMdJFiW5vrW5J8k7Wvk/TfK/W39/k+Q30/NHfXVf0+qOJdma5OPAPa3stUm+2sb85x6rk5jsZ+pcjvwCbAB+DuwAtgFbgY8Du9r2zwB3ATuBNX3t7qf3yc6lwNeBD7c6twJPOcz9vQ3YBdwNbGhlTwM+Qu8P6W7gVa18dSu7F/hgXx8TwFrgK8C/AF4AfKGNcwtw6kI/ri5zfpwuBe5tty+k946g0HuB+TngRcCrgA/3tTm+/Xs/cPJh+n4BcFvf+gnt368Av99uP5neWcargNuARcAI8B3gVGAM+CmwrNU/C/hfwLFt/UPA6xf6cTwalwUfgEvff8av/qH9ykHdyk5q/z6lPTE/s633B8KjwIpW/kngtYe5v/3Ace32wT+8DwL/ra/OifRe1X0HOAU4BrgduLhtL+CSdvtY4MvAKW39NcB1C/24uszrcfrH7fjb0ZY9wBuBM4H72vH0L/vaThUIJwLfAv4MWNlC5unAviF1rwX+oG/9Y8Ar29/O1r7yy9qxfnCMu4GrF/pxPBqXY9DR7KtVdV/f+tuS/H67fTqwHPjhQJv7qmpHu30XvT/eydwN3JTkM/TOPgB+F1h1sEJVPZjkRcB4Vf0AIMlN9F4FfgY4AHyqVX8OcA5wWxLovXL7brdd1WNUgD+sqj8/ZEPyAuDlwB8mubWq1k7VWTvenge8FHgLcAm96w6T3fdkfjpQ74aqunKq+3+i8xrC0e0fD+okY/SerF9YVc8D/pbeqfOgh/tuH4DDhv4rgHX0TtPvSnIMvT+ewQ+nHO4P7xdVdaCv3s6qWtGWc6vqwsO01WPTT+i9aofetOAfJHkaQJIlSZ6V5NnAz6rqRnpnEb81pO0hkpwMPKmqPgX8R+C3qurHwL4kF7c6x7V3Kv0f4DXtusMp9F6kfHVIt38NvDrJs1r7k5KcMZsH4PHKQDi6HO6P5Xjgwar6WZJ/BvzObO4oyZOA06tqK/Aeehe0n0bvusNlffVOpDd/+6+SnNwuxq2md51g0G7glCQvbG2PTfLc2YxTR5+q+iHwpST3Ai+hd53rjiT3ADfTO4bPBb6aZAfwH4D/3JqvBz4/2UVlYAkw3tpdDxx8Vf86emfId9OblvwnwKfpneV+jd405nuq6ntDxrsLeC9wa2t/G71rDRrgJ5WPMu2dEf+c3sXl/1tVv9fKj6M3RbOE9sRLbx50PMn9wCi9J/TPVdU5rc27gKdV1dVD7udYehetj6f3yv7GqrqmvdI7eNZwAHh/Vf1Vkn9D748zwOaqek/rZ6KqntbX7wrgT1u/x9C7HvHhOXyIJM0TA0GSBBx+flmS5k2SrwDHDRS/rqruWYjxyDOEJ4Qk64DzB4r/pKo+shDjkXR0MhAkSYDvMpIkNQaCJAkwECRJjYEgSQIMBElS8/8AL+1vDMN6PQ0AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x256366397f0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))\n",
    "pipe_num = make_pipeline(StandardScaler(), SimpleImputer())\n",
    "preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))\n",
    "\n",
    "pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs'))\n",
    "\n",
    "param_grid = {'columntransformer__pipeline-2__simpleimputer__strategy': ['mean', 'median'],\n",
    "              'logisticregression__C': [0.1, 1.0, 10]}\n",
    "grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)\n",
    "scores = pd.DataFrame(cross_validate(grid, X, y, scoring='balanced_accuracy', cv=5, n_jobs=-1, return_train_score=True))\n",
    "scores[['train_score', 'test_score']].boxplot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Do the following exercise:\n",
    "\n",
    "Load the adult dataset located in `./data/adult_openml.csv`. Make your own `ColumnTransformer` preprocessor. Pipeline it with a classifier. Fine tune it and check the prediction accuracy within a cross-validation.\n",
    "\n",
    "#### 练习\n",
    "完成接下来的练习：\n",
    "\n",
    "加载位于`./data/adult_openml.csv`中的成人数据集。 制作自己的`ColumnTransformer`预处理器，并用分类器管道化它。对其进行微调并在交叉验证中检查预测准确性。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Read the adult dataset located in `./data/adult_openml.csv` using `pd.read_csv`.\n",
    "\n",
    "* 使用`pd.read_csv`读取位于`./data/adult_openml.csv`中的成人数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/05_1_solutions.py\n",
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "data = pd.read_csv(os.path.join('data', 'adult_openml.csv'))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education</th>\n",
       "      <th>education-num</th>\n",
       "      <th>marital-status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>capitalgain</th>\n",
       "      <th>capitalloss</th>\n",
       "      <th>hoursperweek</th>\n",
       "      <th>native-country</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>77516</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>83311</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>215646</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>234721</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>338409</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Cuba</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>284582</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Wife</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>160187</td>\n",
       "      <td>9th</td>\n",
       "      <td>5</td>\n",
       "      <td>Married-spouse-absent</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Jamaica</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>209642</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>45781</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>159449</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>280464</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>141297</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>India</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>122272</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>205019</td>\n",
       "      <td>Assoc-acdm</td>\n",
       "      <td>12</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>121772</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>245487</td>\n",
       "      <td>7th-8th</td>\n",
       "      <td>4</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Transport-moving</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Amer-Indian-Eskimo</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Mexico</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>176756</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>186824</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Machine-op-inspct</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>28887</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>2</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>292175</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>193524</td>\n",
       "      <td>Doctorate</td>\n",
       "      <td>16</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>302146</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Separated</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>1</td>\n",
       "      <td>Federal-gov</td>\n",
       "      <td>76845</td>\n",
       "      <td>9th</td>\n",
       "      <td>5</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>117037</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Transport-moving</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>109015</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Tech-support</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>4</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>216851</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Tech-support</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>168294</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>3</td>\n",
       "      <td>?</td>\n",
       "      <td>180211</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>?</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>South</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>367260</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>193366</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48812</th>\n",
       "      <td>4</td>\n",
       "      <td>?</td>\n",
       "      <td>26711</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>?</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48813</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>117909</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48814</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>229647</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Tech-support</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48815</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>149347</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48816</th>\n",
       "      <td>2</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>23157</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48817</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>93977</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Machine-op-inspct</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48818</th>\n",
       "      <td>4</td>\n",
       "      <td>Self-emp-inc</td>\n",
       "      <td>159691</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48819</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>176967</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Protective-serv</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48820</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>344436</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Other-relative</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48821</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>430340</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48822</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>202168</td>\n",
       "      <td>Prof-school</td>\n",
       "      <td>15</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48823</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>82720</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48824</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>269623</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48825</th>\n",
       "      <td>4</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>136405</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48826</th>\n",
       "      <td>3</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>139347</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48827</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>224655</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Separated</td>\n",
       "      <td>Priv-house-serv</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48828</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>247547</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48829</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>292710</td>\n",
       "      <td>Assoc-acdm</td>\n",
       "      <td>12</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48830</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>173449</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48831</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>285570</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48832</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>89686</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48833</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>440129</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48834</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>350977</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48835</th>\n",
       "      <td>3</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>349230</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48836</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>245211</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48837</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>215419</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48838</th>\n",
       "      <td>4</td>\n",
       "      <td>?</td>\n",
       "      <td>321403</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>?</td>\n",
       "      <td>Other-relative</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48839</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>374983</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48840</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>83891</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48841</th>\n",
       "      <td>1</td>\n",
       "      <td>Self-emp-inc</td>\n",
       "      <td>182148</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;50K</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>48842 rows × 15 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       age         workclass  fnlwgt     education  education-num  \\\n",
       "0        2         State-gov   77516     Bachelors             13   \n",
       "1        3  Self-emp-not-inc   83311     Bachelors             13   \n",
       "2        2           Private  215646       HS-grad              9   \n",
       "3        3           Private  234721          11th              7   \n",
       "4        1           Private  338409     Bachelors             13   \n",
       "5        2           Private  284582       Masters             14   \n",
       "6        3           Private  160187           9th              5   \n",
       "7        3  Self-emp-not-inc  209642       HS-grad              9   \n",
       "8        1           Private   45781       Masters             14   \n",
       "9        2           Private  159449     Bachelors             13   \n",
       "10       2           Private  280464  Some-college             10   \n",
       "11       1         State-gov  141297     Bachelors             13   \n",
       "12       0           Private  122272     Bachelors             13   \n",
       "13       1           Private  205019    Assoc-acdm             12   \n",
       "14       2           Private  121772     Assoc-voc             11   \n",
       "15       1           Private  245487       7th-8th              4   \n",
       "16       0  Self-emp-not-inc  176756       HS-grad              9   \n",
       "17       1           Private  186824       HS-grad              9   \n",
       "18       2           Private   28887          11th              7   \n",
       "19       2  Self-emp-not-inc  292175       Masters             14   \n",
       "20       2           Private  193524     Doctorate             16   \n",
       "21       3           Private  302146       HS-grad              9   \n",
       "22       1       Federal-gov   76845           9th              5   \n",
       "23       2           Private  117037          11th              7   \n",
       "24       4           Private  109015       HS-grad              9   \n",
       "25       4         Local-gov  216851     Bachelors             13   \n",
       "26       0           Private  168294       HS-grad              9   \n",
       "27       3                 ?  180211  Some-college             10   \n",
       "28       2           Private  367260       HS-grad              9   \n",
       "29       3           Private  193366       HS-grad              9   \n",
       "...    ...               ...     ...           ...            ...   \n",
       "48812    4                 ?   26711     Assoc-voc             11   \n",
       "48813    4           Private  117909     Assoc-voc             11   \n",
       "48814    2           Private  229647     Bachelors             13   \n",
       "48815    2           Private  149347       Masters             14   \n",
       "48816    2         Local-gov   23157       Masters             14   \n",
       "48817    0           Private   93977       HS-grad              9   \n",
       "48818    4      Self-emp-inc  159691  Some-college             10   \n",
       "48819    1           Private  176967  Some-college             10   \n",
       "48820    4           Private  344436       HS-grad              9   \n",
       "48821    1           Private  430340  Some-college             10   \n",
       "48822    2           Private  202168   Prof-school             15   \n",
       "48823    3           Private   82720       HS-grad              9   \n",
       "48824    0           Private  269623  Some-college             10   \n",
       "48825    4  Self-emp-not-inc  136405       HS-grad              9   \n",
       "48826    3         Local-gov  139347       Masters             14   \n",
       "48827    3           Private  224655       HS-grad              9   \n",
       "48828    2           Private  247547     Assoc-voc             11   \n",
       "48829    4           Private  292710    Assoc-acdm             12   \n",
       "48830    1           Private  173449       HS-grad              9   \n",
       "48831    3           Private  285570       HS-grad              9   \n",
       "48832    4           Private   89686       HS-grad              9   \n",
       "48833    1           Private  440129       HS-grad              9   \n",
       "48834    0           Private  350977       HS-grad              9   \n",
       "48835    3         Local-gov  349230       Masters             14   \n",
       "48836    1           Private  245211     Bachelors             13   \n",
       "48837    2           Private  215419     Bachelors             13   \n",
       "48838    4                 ?  321403       HS-grad              9   \n",
       "48839    2           Private  374983     Bachelors             13   \n",
       "48840    2           Private   83891     Bachelors             13   \n",
       "48841    1      Self-emp-inc  182148     Bachelors             13   \n",
       "\n",
       "              marital-status         occupation    relationship  \\\n",
       "0              Never-married       Adm-clerical   Not-in-family   \n",
       "1         Married-civ-spouse    Exec-managerial         Husband   \n",
       "2                   Divorced  Handlers-cleaners   Not-in-family   \n",
       "3         Married-civ-spouse  Handlers-cleaners         Husband   \n",
       "4         Married-civ-spouse     Prof-specialty            Wife   \n",
       "5         Married-civ-spouse    Exec-managerial            Wife   \n",
       "6      Married-spouse-absent      Other-service   Not-in-family   \n",
       "7         Married-civ-spouse    Exec-managerial         Husband   \n",
       "8              Never-married     Prof-specialty   Not-in-family   \n",
       "9         Married-civ-spouse    Exec-managerial         Husband   \n",
       "10        Married-civ-spouse    Exec-managerial         Husband   \n",
       "11        Married-civ-spouse     Prof-specialty         Husband   \n",
       "12             Never-married       Adm-clerical       Own-child   \n",
       "13             Never-married              Sales   Not-in-family   \n",
       "14        Married-civ-spouse       Craft-repair         Husband   \n",
       "15        Married-civ-spouse   Transport-moving         Husband   \n",
       "16             Never-married    Farming-fishing       Own-child   \n",
       "17             Never-married  Machine-op-inspct       Unmarried   \n",
       "18        Married-civ-spouse              Sales         Husband   \n",
       "19                  Divorced    Exec-managerial       Unmarried   \n",
       "20        Married-civ-spouse     Prof-specialty         Husband   \n",
       "21                 Separated      Other-service       Unmarried   \n",
       "22        Married-civ-spouse    Farming-fishing         Husband   \n",
       "23        Married-civ-spouse   Transport-moving         Husband   \n",
       "24                  Divorced       Tech-support       Unmarried   \n",
       "25        Married-civ-spouse       Tech-support         Husband   \n",
       "26             Never-married       Craft-repair       Own-child   \n",
       "27        Married-civ-spouse                  ?         Husband   \n",
       "28                  Divorced    Exec-managerial   Not-in-family   \n",
       "29        Married-civ-spouse       Craft-repair         Husband   \n",
       "...                      ...                ...             ...   \n",
       "48812     Married-civ-spouse                  ?         Husband   \n",
       "48813     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48814          Never-married       Tech-support   Not-in-family   \n",
       "48815     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48816     Married-civ-spouse    Exec-managerial         Husband   \n",
       "48817          Never-married  Machine-op-inspct       Own-child   \n",
       "48818               Divorced    Exec-managerial   Not-in-family   \n",
       "48819     Married-civ-spouse    Protective-serv         Husband   \n",
       "48820                Widowed              Sales  Other-relative   \n",
       "48821          Never-married              Sales   Not-in-family   \n",
       "48822     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48823     Married-civ-spouse       Craft-repair         Husband   \n",
       "48824          Never-married       Craft-repair       Own-child   \n",
       "48825                Widowed    Farming-fishing   Not-in-family   \n",
       "48826     Married-civ-spouse     Prof-specialty            Wife   \n",
       "48827              Separated    Priv-house-serv   Not-in-family   \n",
       "48828          Never-married       Adm-clerical       Unmarried   \n",
       "48829               Divorced     Prof-specialty   Not-in-family   \n",
       "48830     Married-civ-spouse  Handlers-cleaners         Husband   \n",
       "48831     Married-civ-spouse       Adm-clerical         Husband   \n",
       "48832     Married-civ-spouse              Sales         Husband   \n",
       "48833     Married-civ-spouse       Craft-repair         Husband   \n",
       "48834          Never-married      Other-service       Own-child   \n",
       "48835               Divorced      Other-service   Not-in-family   \n",
       "48836          Never-married     Prof-specialty       Own-child   \n",
       "48837               Divorced     Prof-specialty   Not-in-family   \n",
       "48838                Widowed                  ?  Other-relative   \n",
       "48839     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48840               Divorced       Adm-clerical       Own-child   \n",
       "48841     Married-civ-spouse    Exec-managerial         Husband   \n",
       "\n",
       "                     race     sex  capitalgain  capitalloss  hoursperweek  \\\n",
       "0                   White    Male            1            0             2   \n",
       "1                   White    Male            0            0             0   \n",
       "2                   White    Male            0            0             2   \n",
       "3                   Black    Male            0            0             2   \n",
       "4                   Black  Female            0            0             2   \n",
       "5                   White  Female            0            0             2   \n",
       "6                   Black  Female            0            0             0   \n",
       "7                   White    Male            0            0             2   \n",
       "8                   White  Female            4            0             3   \n",
       "9                   White    Male            2            0             2   \n",
       "10                  Black    Male            0            0             4   \n",
       "11     Asian-Pac-Islander    Male            0            0             2   \n",
       "12                  White  Female            0            0             1   \n",
       "13                  Black    Male            0            0             3   \n",
       "14     Asian-Pac-Islander    Male            0            0             2   \n",
       "15     Amer-Indian-Eskimo    Male            0            0             2   \n",
       "16                  White    Male            0            0             1   \n",
       "17                  White    Male            0            0             2   \n",
       "18                  White    Male            0            0             3   \n",
       "19                  White  Female            0            0             2   \n",
       "20                  White    Male            0            0             3   \n",
       "21                  Black  Female            0            0             0   \n",
       "22                  Black    Male            0            0             2   \n",
       "23                  White    Male            0            3             2   \n",
       "24                  White  Female            0            0             2   \n",
       "25                  White    Male            0            0             2   \n",
       "26                  White    Male            0            0             2   \n",
       "27     Asian-Pac-Islander    Male            0            0             3   \n",
       "28                  White    Male            0            0             4   \n",
       "29                  White    Male            0            0             2   \n",
       "...                   ...     ...          ...          ...           ...   \n",
       "48812               White    Male            1            0             0   \n",
       "48813               White    Male            3            0             2   \n",
       "48814               White  Female            0            2             2   \n",
       "48815               White    Male            0            0             3   \n",
       "48816               White    Male            0            3             3   \n",
       "48817               White    Male            0            0             2   \n",
       "48818               White  Female            0            0             2   \n",
       "48819               White    Male            0            0             2   \n",
       "48820               White  Female            0            0             0   \n",
       "48821               White  Female            0            0             2   \n",
       "48822               White    Male            4            0             3   \n",
       "48823               White    Male            0            0             2   \n",
       "48824               White    Male            0            0             2   \n",
       "48825               White    Male            0            0             1   \n",
       "48826               White  Female            0            0             2   \n",
       "48827               White  Female            0            0             1   \n",
       "48828               Black  Female            0            0             2   \n",
       "48829               White    Male            0            0             2   \n",
       "48830               White    Male            0            0             2   \n",
       "48831               White    Male            0            0             2   \n",
       "48832               White    Male            0            0             3   \n",
       "48833               White    Male            0            0             2   \n",
       "48834               White  Female            0            0             2   \n",
       "48835               White    Male            0            0             2   \n",
       "48836               White    Male            0            0             2   \n",
       "48837               White  Female            0            0             2   \n",
       "48838               Black    Male            0            0             2   \n",
       "48839               White    Male            0            0             3   \n",
       "48840  Asian-Pac-Islander    Male            2            0             2   \n",
       "48841               White    Male            0            0             3   \n",
       "\n",
       "      native-country  class  \n",
       "0      United-States  <=50K  \n",
       "1      United-States  <=50K  \n",
       "2      United-States  <=50K  \n",
       "3      United-States  <=50K  \n",
       "4               Cuba  <=50K  \n",
       "5      United-States  <=50K  \n",
       "6            Jamaica  <=50K  \n",
       "7      United-States   >50K  \n",
       "8      United-States   >50K  \n",
       "9      United-States   >50K  \n",
       "10     United-States   >50K  \n",
       "11             India   >50K  \n",
       "12     United-States  <=50K  \n",
       "13     United-States  <=50K  \n",
       "14                 ?   >50K  \n",
       "15            Mexico  <=50K  \n",
       "16     United-States  <=50K  \n",
       "17     United-States  <=50K  \n",
       "18     United-States  <=50K  \n",
       "19     United-States   >50K  \n",
       "20     United-States   >50K  \n",
       "21     United-States  <=50K  \n",
       "22     United-States  <=50K  \n",
       "23     United-States  <=50K  \n",
       "24     United-States  <=50K  \n",
       "25     United-States   >50K  \n",
       "26     United-States  <=50K  \n",
       "27             South   >50K  \n",
       "28     United-States  <=50K  \n",
       "29     United-States  <=50K  \n",
       "...              ...    ...  \n",
       "48812  United-States  <=50K  \n",
       "48813  United-States   >50K  \n",
       "48814  United-States  <=50K  \n",
       "48815  United-States   >50K  \n",
       "48816  United-States   >50K  \n",
       "48817  United-States  <=50K  \n",
       "48818  United-States  <=50K  \n",
       "48819  United-States  <=50K  \n",
       "48820  United-States  <=50K  \n",
       "48821  United-States  <=50K  \n",
       "48822  United-States   >50K  \n",
       "48823  United-States  <=50K  \n",
       "48824  United-States  <=50K  \n",
       "48825  United-States  <=50K  \n",
       "48826              ?   >50K  \n",
       "48827  United-States  <=50K  \n",
       "48828  United-States  <=50K  \n",
       "48829  United-States  <=50K  \n",
       "48830  United-States  <=50K  \n",
       "48831  United-States  <=50K  \n",
       "48832  United-States  <=50K  \n",
       "48833  United-States  <=50K  \n",
       "48834  United-States  <=50K  \n",
       "48835  United-States  <=50K  \n",
       "48836  United-States  <=50K  \n",
       "48837  United-States  <=50K  \n",
       "48838  United-States  <=50K  \n",
       "48839  United-States  <=50K  \n",
       "48840  United-States  <=50K  \n",
       "48841  United-States   >50K  \n",
       "\n",
       "[48842 rows x 15 columns]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Split the datasets into a data and a target. The target corresponds to the `class` column. For the data, drop the columns `fnlwgt`, `capitalgain`, and `capitalloss`.\n",
    "\n",
    "* 将数据集拆分为数据和目标。 目标对应于类列。 对于数据，删除列`fnlwgt`，`capitalgain`和`capitalloss`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/05_2_solutions.py\n",
    "y = data['class']\n",
    "X = data.drop(columns=['class', 'fnlwgt', 'capitalgain', 'capitalloss'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass</th>\n",
       "      <th>education</th>\n",
       "      <th>education-num</th>\n",
       "      <th>marital-status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>hoursperweek</th>\n",
       "      <th>native-country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>Cuba</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Wife</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>9th</td>\n",
       "      <td>5</td>\n",
       "      <td>Married-spouse-absent</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>Jamaica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>4</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>India</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>Assoc-acdm</td>\n",
       "      <td>12</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>7th-8th</td>\n",
       "      <td>4</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Transport-moving</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Amer-Indian-Eskimo</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>Mexico</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Machine-op-inspct</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>2</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Doctorate</td>\n",
       "      <td>16</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Separated</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>1</td>\n",
       "      <td>Federal-gov</td>\n",
       "      <td>9th</td>\n",
       "      <td>5</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Transport-moving</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Tech-support</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>4</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Tech-support</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>3</td>\n",
       "      <td>?</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>?</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>South</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>4</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48812</th>\n",
       "      <td>4</td>\n",
       "      <td>?</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>?</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48813</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48814</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Tech-support</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48815</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48816</th>\n",
       "      <td>2</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48817</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Machine-op-inspct</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48818</th>\n",
       "      <td>4</td>\n",
       "      <td>Self-emp-inc</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48819</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Protective-serv</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48820</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Other-relative</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48821</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48822</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Prof-school</td>\n",
       "      <td>15</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48823</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48824</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>Some-college</td>\n",
       "      <td>10</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48825</th>\n",
       "      <td>4</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>Farming-fishing</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48826</th>\n",
       "      <td>3</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48827</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Separated</td>\n",
       "      <td>Priv-house-serv</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>1</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48828</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Assoc-voc</td>\n",
       "      <td>11</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48829</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>Assoc-acdm</td>\n",
       "      <td>12</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48830</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48831</th>\n",
       "      <td>3</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48832</th>\n",
       "      <td>4</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Sales</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48833</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Craft-repair</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48834</th>\n",
       "      <td>0</td>\n",
       "      <td>Private</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48835</th>\n",
       "      <td>3</td>\n",
       "      <td>Local-gov</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48836</th>\n",
       "      <td>1</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48837</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48838</th>\n",
       "      <td>4</td>\n",
       "      <td>?</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>?</td>\n",
       "      <td>Other-relative</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48839</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48840</th>\n",
       "      <td>2</td>\n",
       "      <td>Private</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Own-child</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>2</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48841</th>\n",
       "      <td>1</td>\n",
       "      <td>Self-emp-inc</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>3</td>\n",
       "      <td>United-States</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>48842 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       age         workclass     education  education-num  \\\n",
       "0        2         State-gov     Bachelors             13   \n",
       "1        3  Self-emp-not-inc     Bachelors             13   \n",
       "2        2           Private       HS-grad              9   \n",
       "3        3           Private          11th              7   \n",
       "4        1           Private     Bachelors             13   \n",
       "5        2           Private       Masters             14   \n",
       "6        3           Private           9th              5   \n",
       "7        3  Self-emp-not-inc       HS-grad              9   \n",
       "8        1           Private       Masters             14   \n",
       "9        2           Private     Bachelors             13   \n",
       "10       2           Private  Some-college             10   \n",
       "11       1         State-gov     Bachelors             13   \n",
       "12       0           Private     Bachelors             13   \n",
       "13       1           Private    Assoc-acdm             12   \n",
       "14       2           Private     Assoc-voc             11   \n",
       "15       1           Private       7th-8th              4   \n",
       "16       0  Self-emp-not-inc       HS-grad              9   \n",
       "17       1           Private       HS-grad              9   \n",
       "18       2           Private          11th              7   \n",
       "19       2  Self-emp-not-inc       Masters             14   \n",
       "20       2           Private     Doctorate             16   \n",
       "21       3           Private       HS-grad              9   \n",
       "22       1       Federal-gov           9th              5   \n",
       "23       2           Private          11th              7   \n",
       "24       4           Private       HS-grad              9   \n",
       "25       4         Local-gov     Bachelors             13   \n",
       "26       0           Private       HS-grad              9   \n",
       "27       3                 ?  Some-college             10   \n",
       "28       2           Private       HS-grad              9   \n",
       "29       3           Private       HS-grad              9   \n",
       "...    ...               ...           ...            ...   \n",
       "48812    4                 ?     Assoc-voc             11   \n",
       "48813    4           Private     Assoc-voc             11   \n",
       "48814    2           Private     Bachelors             13   \n",
       "48815    2           Private       Masters             14   \n",
       "48816    2         Local-gov       Masters             14   \n",
       "48817    0           Private       HS-grad              9   \n",
       "48818    4      Self-emp-inc  Some-college             10   \n",
       "48819    1           Private  Some-college             10   \n",
       "48820    4           Private       HS-grad              9   \n",
       "48821    1           Private  Some-college             10   \n",
       "48822    2           Private   Prof-school             15   \n",
       "48823    3           Private       HS-grad              9   \n",
       "48824    0           Private  Some-college             10   \n",
       "48825    4  Self-emp-not-inc       HS-grad              9   \n",
       "48826    3         Local-gov       Masters             14   \n",
       "48827    3           Private       HS-grad              9   \n",
       "48828    2           Private     Assoc-voc             11   \n",
       "48829    4           Private    Assoc-acdm             12   \n",
       "48830    1           Private       HS-grad              9   \n",
       "48831    3           Private       HS-grad              9   \n",
       "48832    4           Private       HS-grad              9   \n",
       "48833    1           Private       HS-grad              9   \n",
       "48834    0           Private       HS-grad              9   \n",
       "48835    3         Local-gov       Masters             14   \n",
       "48836    1           Private     Bachelors             13   \n",
       "48837    2           Private     Bachelors             13   \n",
       "48838    4                 ?       HS-grad              9   \n",
       "48839    2           Private     Bachelors             13   \n",
       "48840    2           Private     Bachelors             13   \n",
       "48841    1      Self-emp-inc     Bachelors             13   \n",
       "\n",
       "              marital-status         occupation    relationship  \\\n",
       "0              Never-married       Adm-clerical   Not-in-family   \n",
       "1         Married-civ-spouse    Exec-managerial         Husband   \n",
       "2                   Divorced  Handlers-cleaners   Not-in-family   \n",
       "3         Married-civ-spouse  Handlers-cleaners         Husband   \n",
       "4         Married-civ-spouse     Prof-specialty            Wife   \n",
       "5         Married-civ-spouse    Exec-managerial            Wife   \n",
       "6      Married-spouse-absent      Other-service   Not-in-family   \n",
       "7         Married-civ-spouse    Exec-managerial         Husband   \n",
       "8              Never-married     Prof-specialty   Not-in-family   \n",
       "9         Married-civ-spouse    Exec-managerial         Husband   \n",
       "10        Married-civ-spouse    Exec-managerial         Husband   \n",
       "11        Married-civ-spouse     Prof-specialty         Husband   \n",
       "12             Never-married       Adm-clerical       Own-child   \n",
       "13             Never-married              Sales   Not-in-family   \n",
       "14        Married-civ-spouse       Craft-repair         Husband   \n",
       "15        Married-civ-spouse   Transport-moving         Husband   \n",
       "16             Never-married    Farming-fishing       Own-child   \n",
       "17             Never-married  Machine-op-inspct       Unmarried   \n",
       "18        Married-civ-spouse              Sales         Husband   \n",
       "19                  Divorced    Exec-managerial       Unmarried   \n",
       "20        Married-civ-spouse     Prof-specialty         Husband   \n",
       "21                 Separated      Other-service       Unmarried   \n",
       "22        Married-civ-spouse    Farming-fishing         Husband   \n",
       "23        Married-civ-spouse   Transport-moving         Husband   \n",
       "24                  Divorced       Tech-support       Unmarried   \n",
       "25        Married-civ-spouse       Tech-support         Husband   \n",
       "26             Never-married       Craft-repair       Own-child   \n",
       "27        Married-civ-spouse                  ?         Husband   \n",
       "28                  Divorced    Exec-managerial   Not-in-family   \n",
       "29        Married-civ-spouse       Craft-repair         Husband   \n",
       "...                      ...                ...             ...   \n",
       "48812     Married-civ-spouse                  ?         Husband   \n",
       "48813     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48814          Never-married       Tech-support   Not-in-family   \n",
       "48815     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48816     Married-civ-spouse    Exec-managerial         Husband   \n",
       "48817          Never-married  Machine-op-inspct       Own-child   \n",
       "48818               Divorced    Exec-managerial   Not-in-family   \n",
       "48819     Married-civ-spouse    Protective-serv         Husband   \n",
       "48820                Widowed              Sales  Other-relative   \n",
       "48821          Never-married              Sales   Not-in-family   \n",
       "48822     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48823     Married-civ-spouse       Craft-repair         Husband   \n",
       "48824          Never-married       Craft-repair       Own-child   \n",
       "48825                Widowed    Farming-fishing   Not-in-family   \n",
       "48826     Married-civ-spouse     Prof-specialty            Wife   \n",
       "48827              Separated    Priv-house-serv   Not-in-family   \n",
       "48828          Never-married       Adm-clerical       Unmarried   \n",
       "48829               Divorced     Prof-specialty   Not-in-family   \n",
       "48830     Married-civ-spouse  Handlers-cleaners         Husband   \n",
       "48831     Married-civ-spouse       Adm-clerical         Husband   \n",
       "48832     Married-civ-spouse              Sales         Husband   \n",
       "48833     Married-civ-spouse       Craft-repair         Husband   \n",
       "48834          Never-married      Other-service       Own-child   \n",
       "48835               Divorced      Other-service   Not-in-family   \n",
       "48836          Never-married     Prof-specialty       Own-child   \n",
       "48837               Divorced     Prof-specialty   Not-in-family   \n",
       "48838                Widowed                  ?  Other-relative   \n",
       "48839     Married-civ-spouse     Prof-specialty         Husband   \n",
       "48840               Divorced       Adm-clerical       Own-child   \n",
       "48841     Married-civ-spouse    Exec-managerial         Husband   \n",
       "\n",
       "                     race     sex  hoursperweek native-country  \n",
       "0                   White    Male             2  United-States  \n",
       "1                   White    Male             0  United-States  \n",
       "2                   White    Male             2  United-States  \n",
       "3                   Black    Male             2  United-States  \n",
       "4                   Black  Female             2           Cuba  \n",
       "5                   White  Female             2  United-States  \n",
       "6                   Black  Female             0        Jamaica  \n",
       "7                   White    Male             2  United-States  \n",
       "8                   White  Female             3  United-States  \n",
       "9                   White    Male             2  United-States  \n",
       "10                  Black    Male             4  United-States  \n",
       "11     Asian-Pac-Islander    Male             2          India  \n",
       "12                  White  Female             1  United-States  \n",
       "13                  Black    Male             3  United-States  \n",
       "14     Asian-Pac-Islander    Male             2              ?  \n",
       "15     Amer-Indian-Eskimo    Male             2         Mexico  \n",
       "16                  White    Male             1  United-States  \n",
       "17                  White    Male             2  United-States  \n",
       "18                  White    Male             3  United-States  \n",
       "19                  White  Female             2  United-States  \n",
       "20                  White    Male             3  United-States  \n",
       "21                  Black  Female             0  United-States  \n",
       "22                  Black    Male             2  United-States  \n",
       "23                  White    Male             2  United-States  \n",
       "24                  White  Female             2  United-States  \n",
       "25                  White    Male             2  United-States  \n",
       "26                  White    Male             2  United-States  \n",
       "27     Asian-Pac-Islander    Male             3          South  \n",
       "28                  White    Male             4  United-States  \n",
       "29                  White    Male             2  United-States  \n",
       "...                   ...     ...           ...            ...  \n",
       "48812               White    Male             0  United-States  \n",
       "48813               White    Male             2  United-States  \n",
       "48814               White  Female             2  United-States  \n",
       "48815               White    Male             3  United-States  \n",
       "48816               White    Male             3  United-States  \n",
       "48817               White    Male             2  United-States  \n",
       "48818               White  Female             2  United-States  \n",
       "48819               White    Male             2  United-States  \n",
       "48820               White  Female             0  United-States  \n",
       "48821               White  Female             2  United-States  \n",
       "48822               White    Male             3  United-States  \n",
       "48823               White    Male             2  United-States  \n",
       "48824               White    Male             2  United-States  \n",
       "48825               White    Male             1  United-States  \n",
       "48826               White  Female             2              ?  \n",
       "48827               White  Female             1  United-States  \n",
       "48828               Black  Female             2  United-States  \n",
       "48829               White    Male             2  United-States  \n",
       "48830               White    Male             2  United-States  \n",
       "48831               White    Male             2  United-States  \n",
       "48832               White    Male             3  United-States  \n",
       "48833               White    Male             2  United-States  \n",
       "48834               White  Female             2  United-States  \n",
       "48835               White    Male             2  United-States  \n",
       "48836               White    Male             2  United-States  \n",
       "48837               White  Female             2  United-States  \n",
       "48838               Black    Male             2  United-States  \n",
       "48839               White    Male             3  United-States  \n",
       "48840  Asian-Pac-Islander    Male             2  United-States  \n",
       "48841               White    Male             3  United-States  \n",
       "\n",
       "[48842 rows x 11 columns]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        <=50K\n",
       "1        <=50K\n",
       "2        <=50K\n",
       "3        <=50K\n",
       "4        <=50K\n",
       "5        <=50K\n",
       "6        <=50K\n",
       "7         >50K\n",
       "8         >50K\n",
       "9         >50K\n",
       "10        >50K\n",
       "11        >50K\n",
       "12       <=50K\n",
       "13       <=50K\n",
       "14        >50K\n",
       "15       <=50K\n",
       "16       <=50K\n",
       "17       <=50K\n",
       "18       <=50K\n",
       "19        >50K\n",
       "20        >50K\n",
       "21       <=50K\n",
       "22       <=50K\n",
       "23       <=50K\n",
       "24       <=50K\n",
       "25        >50K\n",
       "26       <=50K\n",
       "27        >50K\n",
       "28       <=50K\n",
       "29       <=50K\n",
       "         ...  \n",
       "48812    <=50K\n",
       "48813     >50K\n",
       "48814    <=50K\n",
       "48815     >50K\n",
       "48816     >50K\n",
       "48817    <=50K\n",
       "48818    <=50K\n",
       "48819    <=50K\n",
       "48820    <=50K\n",
       "48821    <=50K\n",
       "48822     >50K\n",
       "48823    <=50K\n",
       "48824    <=50K\n",
       "48825    <=50K\n",
       "48826     >50K\n",
       "48827    <=50K\n",
       "48828    <=50K\n",
       "48829    <=50K\n",
       "48830    <=50K\n",
       "48831    <=50K\n",
       "48832    <=50K\n",
       "48833    <=50K\n",
       "48834    <=50K\n",
       "48835    <=50K\n",
       "48836    <=50K\n",
       "48837    <=50K\n",
       "48838    <=50K\n",
       "48839    <=50K\n",
       "48840    <=50K\n",
       "48841     >50K\n",
       "Name: class, Length: 48842, dtype: object"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The target is not encoded. Use the `sklearn.preprocessing.LabelEncoder` to encode the class.\n",
    "\n",
    "* 目标未编码。使用`sklearn.preprocessing.LabelEncoder`对类进行编码。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/05_3_solutions.py\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "encoder = LabelEncoder()\n",
    "y = encoder.fit_transform(y)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 0, 1])"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Create a list containing the name of the categorical columns. Similarly, do the same for the numerical data.\n",
    "\n",
    "* 创建一个包含分类列名称的列表。 同样，对数值数据也一样。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/05_4_solutions.py\n",
    "col_cat = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country', 'sex']\n",
    "col_num = ['age', 'hoursperweek']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Create a pipeline to one-hot encode the categorical data. Use the `KBinsDiscretizer` for the numerical data. Import it from `sklearn.preprocessing`.\n",
    "* 创建一个管道以对分类数据进行读热编码。 使用`KBinsDiscretizer`作为数值数据。 从`sklearn.preprocessing`导入它。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/05_5_solutions.py\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.preprocessing import KBinsDiscretizer\n",
    "\n",
    "pipe_cat = OneHotEncoder(handle_unknown='ignore')\n",
    "pipe_num = KBinsDiscretizer()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Create a `preprocessor` by using the `make_column_transformer`. You should apply the good pipeline to the good column.\n",
    "* 使用`make_column_transformer`创建预处理器。 您应该将好的管道应用于好的列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\compose\\_column_transformer.py:739: DeprecationWarning: `make_column_transformer` now expects (transformer, columns) as input tuples instead of (columns, transformer). This has been introduced in v0.20.1. `make_column_transformer` will stop accepting the deprecated (columns, transformer) order in v0.22.\n",
      "  warnings.warn(message, DeprecationWarning)\n"
     ]
    }
   ],
   "source": [
    "# %load solutions/05_6_solutions.py\n",
    "from sklearn.compose import make_column_transformer\n",
    "preprocessor = make_column_transformer((col_cat, pipe_cat, ), (col_num, pipe_num))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Pipeline the preprocessor with a `LogisticRegression` classifier. Subsequently define a grid-search to find the best parameter `C`. Train and test this workflow in a cross-validation scheme using `cross_validate`.\n",
    "\n",
    "* 使用`LogisticRegression`分类器对预处理器进行管道传输。 随后定义网格搜索以找到最佳参数`C`.使用`cross_validate`在交叉验证方案中训练和测试此工作流程。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x256366cbe10>"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD9CAYAAAC85wBuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFPZJREFUeJzt3X+wXOV93/H3R+JHCDIgB3xrA0VKQqipiSHcIWFS3Ju6uDik4NSZWII4pXZHMxGyJ3hqWx63hB+esTwxg39Ak8iJI5ofqBQ7LrFkwA1aMpNQW+Dym0BkGaIfbWpnjOMLtID67R/73GRZ7dXdvbrSvRLv18wZ3ec5z3n22aOz+9lzzu45qSokSVo03wOQJC0MBoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDVHzPcARnHiiSfWsmXL5nsYh4XnnnuOY489dr6HIQ3k9jm3Hnjgge9U1UkztTukAmHZsmXcf//98z2Mw0Kn02FiYmK+hyEN5PY5t5I8M0w7DxlJkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJzSP0wTaNLMqvlvNe29OrjHsJhrqoGTqd9+MvTzjMMpFcnA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgR4C01J82g2t3j1jn4HzlB7CEkuSvJkkm1J1g6Yf2OSB9v0VJJne+bt6Zl3R0/98iRfS/KXSf5zkqPm5ilJOlTM5havOnBmDIQki4GbgbcDZwIrk5zZ26aqrqqqs6vqbOCzwBd7Zr8wNa+qLump/wRwY1WdDnwXeO9+PhdJ0n4YZg/hPGBbVW2vqheBjcCl+2i/Erh1Xx2mu5/4z4DbW9UtwDuGGIsk6QAZJhBOBnb0lHe2ur0kOQ1YDtzTU/0DSe5P8t+TTL3p/xDwbFW9PFOfkqSDY5iTyoPO+kx3IG8FcHtV7emp+4dVtTvJDwP3JHkE+Nth+0yyClgFMDY2RqfTGWLIGobrUguZ2+fBN0wg7ARO7SmfAuyepu0K4Mreiqra3f7dnqQDnAN8ATghyRFtL2HaPqtqPbAeYHx8vCYmJoYYsmZ05yZcl1qw3D7nxTCHjLYCp7dvBR1F903/jv5GSc4AlgL39dQtTXJ0+/tE4KeBx6v7VYEtwC+0pv8a+K/780QkSftnxkBon+DXAHcBTwC3VdVjSa5L0vutoZXAxnrl98LeCNyf5CG6AbCuqh5v8z4MfCDJNrrnFH5n/5+OJGm2hvphWlVtBjb31V3dV75mwHJ/Dpw1TZ/b6X6DSZK0AHjpCkkSYCBIkhqvZXSYePO1d/O9F14aaZllazeN1P74Y47koV9720jLSDp0GAiHie+98BJPr7t46PadTmfkr/WNGiCSDi0eMpIkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBAwZCEkuSvJkkm1J1g6Yf2OSB9v0VJJn++Yfl2RXkpt66lYmeSTJw0nuTHLi/j8dSdJszRgISRYDNwNvB84EViY5s7dNVV1VVWdX1dnAZ4Ev9nVzPXBvT59HAJ8Gfqaqfhx4GFizP09EkrR/htlDOA/YVlXbq+pFYCNw6T7arwRunSokORcYA+7uaZM2HZskwHHA7hHHLkmaQ8MEwsnAjp7yzla3lySnAcuBe1p5EXAD8MHedlX1EvArwCN0g+BM4HdGHLskaQ4dMUSbDKiradquAG6vqj2tvBrYXFU7ujsCrcPkSLqBcA6wne5hpo8AH9vrwZNVwCqAsbExOp3OEEN+dRpl3UxOTs5qXbr+dbC4rR18wwTCTuDUnvIpTH94ZwVwZU/5fOCCJKuBJcBRSSaBLwBU1TcBktwG7HWyurVZD6wHGB8fr4mJiSGG/Cp05yZGWTedTmek9rN5DAngzdfezfdeeGnk5a6487mh2x5/zJE89GtvG/kx9ErDBMJW4PQky4FddN/0L+tvlOQMYClw31RdVV3eM/8KYLyq1iZ5A3BmkpOq6tvAhcAT+/NEJC1M33vhJZ5ed/FIy4z6gWXZ2k0jjkqDzBgIVfVykjXAXcBi4PNV9ViS64D7q+qO1nQlsLGqpjuc1Nvn7iTXAn+a5CXgGeCK2T4JwWveuJazbhm4kzW9W0Z9DIDRXtiSDh3D7CFQVZuBzX11V/eVr5mhjw3Ahp7ybwK/OdwwNZPvP7FupE9hszlk5Kcw6fDmL5UlSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkZ6odpOjSM/MOxO0drf/wxR47Wv6RDioFwmBj1WjHL1m4aeRlJhzcPGUmSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAf6l82Esy/bxPTL9cVR2A0UhayNxDOMxV1cBpy5Yt084zDKRXJwNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqhgqEJBcleTLJtiRrB8y/McmDbXoqybN9849LsivJTT11RyVZ39r/RZJ37v/TkSTN1ow/TEuyGLgZuBDYCWxNckdVPT7Vpqqu6mn/PuCcvm6uB+7tq/so8L+r6seSLAJeO7unIEmaC8PsIZwHbKuq7VX1IrARuHQf7VcCt04VkpwLjAF397V7D/BxgKr6f1X1nVEGLkmaW8NcuuJkYEdPeSfwk4MaJjkNWA7c08qLgBuAdwNv7Wl3Qvvz+iQTwDeBNVX11wP6XAWsAhgbG6PT6QwxZM1kcnLSdamDZtRtbTbbp9vz/hsmEAZdDGe6axusAG6vqj2tvBrYXFU7+q6pcwRwCvBnVfWBJB8APkk3OF75QFXrgfUA4+PjNTExMcSQNZNOp4PrUgfFnZtG3tZG3j5n8Rja2zCBsBM4tad8CrB7mrYrgCt7yucDFyRZDSwBjkoyCXwEeB74o9buvwDvHWHckqQ5NkwgbAVOT7Ic2EX3Tf+y/kZJzgCWAvdN1VXV5T3zrwDGq2ptK/8xMEH38NJbgceRJM2bGQOhql5Osga4C1gMfL6qHktyHXB/Vd3Rmq4ENtbwl8r8MPB7ST4FfBv4N6MPX5I0V4a6H0JVbQY299Vd3Ve+ZoY+NgAbesrPAG8ZbpiSpAPNXypLkgADQZLUeAtNSQfUa964lrNu2euKNzO7ZZTHALh49MfQKxgIkg6o7z+xjqfXjfZmPervEJat3TTiqDSIh4wkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkZqhASHJRkieTbEuydsD8G5M82KankjzbN/+4JLuS3DRg2TuSPDr7pyBJmgtHzNQgyWLgZuBCYCewNckdVfX4VJuquqqn/fuAc/q6uR64d0Df/wqYnN3QJUlzaZg9hPOAbVW1vapeBDYCl+6j/Urg1qlCknOBMeDu3kZJlgAfAD426qAlSXNvmEA4GdjRU97Z6vaS5DRgOXBPKy8CbgA+OKD59W3e8yOMV5J0gMx4yAjIgLqapu0K4Paq2tPKq4HNVbUj+ftukpwN/GhVXZVk2T4fPFkFrAIYGxuj0+kMMWTNZHJy0nWpg2bUbW0226fb8/4bJhB2Aqf2lE8Bdk/TdgVwZU/5fOCCJKuBJcBRSSaBZ4BzkzzdxvC6JJ2qmujvsKrWA+sBxsfHa2JiryaahU6ng+tSB8Wdm0be1kbePmfxGNrbMIGwFTg9yXJgF903/cv6GyU5A1gK3DdVV1WX98y/AhivqqlvKf1Gq18GfHlQGEiSDp4ZzyFU1cvAGuAu4Angtqp6LMl1SS7paboS2FhV0x1OkiQtYMPsIVBVm4HNfXVX95WvmaGPDcCGAfVPA28aZhySpAPHXypLkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSc1Ql7+WpP2xbO2m0Re6c/hljj/myNH7114MBEkH1NPrLh55mWVrN81qOe0fDxlJkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkYMhASHJRkieTbEuydsD8G5M82KankjzbN/+4JLuS3NTKP5hkU5K/SPJYknVz83QkSbM148XtkiwGbgYuBHYCW5PcUVWPT7Wpqqt62r8POKevm+uBe/vqPllVW5IcBfxJkrdX1Vdm+TwkSftpmD2E84BtVbW9ql4ENgKX7qP9SuDWqUKSc4Ex4O6puqp6vqq2tL9fBL4BnDL68CVJc2WYy1+fDOzoKe8EfnJQwySnAcuBe1p5EXAD8G7grdMscwLwL4FPTzN/FbAKYGxsjE6nM8SQNZPJyUnXpRY0t8+Db5hAyIC6mqbtCuD2qtrTyquBzVW1I9m7myRH0N2b+ExVbR/UYVWtB9YDjI+P18TExBBD1kw6nQ6uSy1Yd25y+5wHwwTCTuDUnvIpwO5p2q4Aruwpnw9ckGQ1sAQ4KslkVU2dmF4P/GVVfWq0YUuS5towgbAVOD3JcmAX3Tf9y/obJTkDWArcN1VXVZf3zL8CGJ8KgyQfA44H/u1+jF+SNEdmPKlcVS8Da4C7gCeA26rqsSTXJbmkp+lKYGNVTXc46e8kOQX4KHAm8I32dVWDQZLm0VD3VK6qzcDmvrqr+8rXzNDHBmBD+3sng89NSJLmib9UliQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqRkqEJJclOTJJNuSrB0w/8YkD7bpqSTP9s0/LsmuJDf11J2b5JHW52eSZP+fjqRDSZKB0zOf+Llp5+nAmTEQkiwGbgbeDpwJrExyZm+bqrqqqs6uqrOBzwJf7OvmeuDevrrfAFYBp7fpolk9A0mHrKoaOG3ZsmXaeTpwhtlDOA/YVlXbq+pFYCNw6T7arwRunSokORcYA+7uqXs9cFxV3Vfd/+H/BLxjFuOXJM2RYQLhZGBHT3lnq9tLktOA5cA9rbwIuAH44IA+dw7TpyTp4DhiiDaDDtpNt9+2Ari9qva08mpgc1Xt6Dv2N3SfSVbRPbTE2NgYnU5niCFrJpOTk65LLVhun/NjmEDYCZzaUz4F2D1N2xXAlT3l84ELkqwGlgBHJZkEPt36mbHPqloPrAcYHx+viYmJIYasmXQ6HVyXWqjcPufHMIGwFTg9yXJgF903/cv6GyU5A1gK3DdVV1WX98y/AhivqrWt/P0kPwV8DfhluiejJUnzZMZzCFX1MrAGuAt4Aritqh5Lcl2SS3qargQ21vBfA/gV4LeBbcA3ga+MNHJJ0pwaZg+BqtoMbO6ru7qvfM0MfWwANvSU7wfeNNwwJUkHmr9UliQBkEPphx5Jvg08M9/jOEycCHxnvgchTcPtc26dVlUnzdTokAoEzZ0k91fV+HyPQxrE7XN+eMhIkgQYCJKkxkB49Vo/3wOQ9sHtcx54DkGSBLiHIElqDARJQ0tyQrs22WyW/dUkPzjXY9LcMRAWkNm+2JJsTnLCgRiT1OcEulcxno1fBQ5aILSbe2kEBsLCMvDFNtOGXVU/W1XP7qvNweAL8FVhHfAj7Xa5v57kg0m2Jnk4ybUASY5NsinJQ0keTfKuJO8H3gBsSbJlUMdJFifZ0JZ5JMlVrf5Hk/y31t83kvxIun69p+27WtuJJFuS/CHwSKv7pSRfb2P+LbfTfZjuNnVOB3+ieze6F4AH6V5ldgvwh8Djbf6XgAeAx4BVPcs9TfeXncvoXoDwc63N3cAx+3i89wOPAw/TvTAhdC9T/rt0X0wPA+9s9Stb3aPAJ3r6mASuo3vV2n8CnEv3dqkP0L0g4uvne706zek2ugx4tP39NrrfBgrdD5dfBt4CvBP4XM8yx7d/nwZO3Eff5wJf7Smf0P79GvDz7e8foLuX8U7gq8Biundk/Cvg9cAE8BywvLV/I/DHwJGt/B+BX57v9bhQp3kfgFPPf8YrX2yv2LBb3Wvbv8e0N+YfauXeQHgZOLvV3wb80j4ebzdwdPt76sX3CeBTPW2W0v1k91fASXQviHgP8I42v4BfbH8fCfw5cFIrvwv4/HyvV6cDto1+sm17D7ZpG/Be4MeAb7Vt6YKeZWcKhKV0r3z8Wbr3WF8EvAbYOaDtjcB7esq/B1zSXjdbeurXtO18aoxPAtfM93pcqNNQVzvVvPl6VX2rp/z+JD/f/j4VOB34m75lvlVVD7a/H6D7Ap7Ow8AfJPkS3b0PgH9O954XAFTVd5O8BehU1bcBkvwB3U+CXwL2AF9ozc+gewXbr7Y75C0G/udwT1WHoAAfr6rf2mtG917qPwt8PMndVXXdTJ21be3NwL+ge6OtX6R73mG6x57Oc33tbqmqj8z0+PIcwkL3dxt2kgm6b9bnV9Wbgf9Bd/e53//t+XsP+77E+cXAzXR31R9IcgTdF1D/j1P29eL7P/X3t0wN8FhVnd2ms6rqbftYVoee79P91A7dQ4LvSbIEIMnJSV6X5A3A81X1+3T3In5iwLJ7SXIisKiqvgD8B+AnqupvgZ1J3tHaHN2+qfSnwLvaeYeT6H5A+fqAbv8E+IUkr2vLv7bd+10DGAgLy75eMMcD362q55P8I+Cn9ueBkiwCTq2qLcCH6J7QXkL3vMOannZL6R7D/adJTmwn5FbSPU/Q70ngpCTnt2WPTPKP92ecWliq6m+AP0vyKHAh3XNc9yV5BLid7vZ7FvD1JA8CHwU+1hZfD3xlupPKwMlApy23AZj6VP9uunvHD9M9JPkPgD+iu4f7EN1DmB+qqv81YLyPA/8euLst/1W65xo0gL9UXmDatyN+nO7J5b+uqp9r9UfTPURzMu2Nl+6x0E6Sp4Fxum/oX66qN7Vl/h2wpAbcvCjJkXRPWh9P95P971fVuvZpb2qvYQ9wbVV9MclldF+gATZX1YdaP5NVtaSn37OBz7R+j6B7PuJzc7iKJB0gBoIkCRjyFpqSNJeSfA04uq/63VX1yHyMR13uIbwKJLkZ+Om+6k9X1e/Ox3gkLUwGgiQJ8FtGkqTGQJAkAQaCJKkxECRJgIEgSWr+P3tYos/Zzk+rAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x25636618ac8>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# %load solutions/05_7_solutions.py\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs', max_iter=1000))\n",
    "param_grid = {'logisticregression__C': [0.1, 1.0, 10]}\n",
    "grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)\n",
    "scores = pd.DataFrame(cross_validate(grid, X, y, scoring='balanced_accuracy', cv=3, n_jobs=-1, return_train_score=True))\n",
    "scores[['train_score', 'test_score']].boxplot(whis=10)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
